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Abstract 

This thesis addresses automatic lexical error recovery and tokcnization of cor- 
rupt text input. We propose a technique that can automatically correct mis- 
spellings, segmentation errors and real-word errors in a unified framework that 
uses both a model of language production and a model of the typing behavior, 
and which makes tokenization part of the recovery process. 

The typing process is modeled as a noisy channel where Hidden Markov 
Models are used to model the channel characteristics. Weak statistical language 
models are used to predict what sentences are likely to be transmitted through 
the channel. These components are held together in the Token Passing frame- 
work which provides the desired tight coupling between orthographic pattern 
matching and linguistic expectation. 

The system, CTR (Connected Text Recognition), has been tested on two 

corpora derived from two different applications, a natural language dialogue 

system and a transcription typing scenario. Experiments show that CTR can 

J> ' automatically correct a considerable portion of the errors in the test sets without 

introducing too much noise. The segmentation error correction rate is virtually 

^^ ' faultless. 
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Chapter 1 

Introduction 



Practical Natural Language Processing (NLP) systems can not expect all the 
input to conform to the grammars encoded in them. An NLP system that is 
able to handle input that in some way deviates from the language defined by the 
grammar encoded in the system is called robust. The uncxpcctancy of the input 
is due to the input being ungrammatical or extragrammatical. Ungrammatical 
input is judged by humans as being erroneous or strange in some way whereas 
extragrammatical input contains no errors, it is just that the input entered to the 
system happens to lie outside of the grammar's coverage. The distinction is not 
unimportant, e.g. on the lexical level, it would be a great help to know whether 
an unrecognized string is a misspelling or a correctly spelled unknown word. 
The problem of distinguishing the two cases is, however, in general impossible 
to solve. In theory it is possible to write a grammar and a lexicon that fully 
cover a particular language and exclude everything that is not in the language, 
so researchers in robust natural language processing tend to adopt the view that 
unparsable input is ungrammatical. 

Robustness is called for in all modes of language communication and in 
virtually all applications that one can think of. The traditional NLP applica- 
tions with machine-readable texts and keyboard-entered input include machine 
translation, information retrieval, grammar/style checkers, text/code editing, 
and other NLI (Natural Language Interface) applications such as computer- 
aided authoring, computer-based language learning/tutoring and NL dialogue 
systems. Speech processing applications require robustness, especially speech 
recognition (speech-to-text). Pen-based interfaces (handwriting readers) and 
optical character recognition (OCR) devices have to have their output further 
processed to improve recognition performance. The latter cases, where a media 
shift (recognition) takes place, are particularly troublesome since the machinery 
performing the recognition introduces errors which add to the human-generated 
errors that were already in the first medium. 

The action taken by the robust text processor in the face of ill-formcdncss 
of course varies from application to application. A grammar- or style-checker 
may highlight a portion of the input and suggest a better way to formulate 
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the particular passage. A computer-based language learning system should be 
equipped with good diagnostics abilities so that whatever is fed back to the 
learner is informative and relevant. An NL dialogue system should be able 
to perform simpler corrections to the input and enter into clarification sub- 
dialogues when more serious conditions arise in order to make the dialogue flow 
more naturally and minimize interruptions. 

The development of the robust text processing techniques presented in this 
thesis takes as a starting-point keyboard-entered input to an NL dialogue sys- 
tem. An examination of a dialogue corpus showed that the lexical errors are 
more urgent than other error types, where the overall goal is to facilitate as 
many as possible of the user's inputs being interpretable. Furthermore, the ma- 
jority of the lexical errors can be automatically corrected, which means that the 
user would not be bothered by simple typing mistakes. The scope of this thesis 
is thus the automatic correction of lexical errors, and although the techniques 
presented here are not limited to text-based dialogue systems, this application 
is the one assumed. 

The lexical errors are broadly divided into the two major error categories of 
misspellings and segmentation errors. 

==> What is the maintenance-cost for the respective models in (1) 
the aboue table 

Utterance 1 (1) displays a typical misspelling, (the highlighted 'aboue'). It is 
an example of a so called nonword misspelling, i.e. the erroneous token is not to 
be found in the system's vocabulary. The majority of the techniques for auto- 
matic spelling error correction developed over the years deal with this error type 
only. It is possible to form a correction hypothesis for ' aboue ' by comparing 
it to the valid words in the vocabulary. This approach is called isolated word 
error correction and utilizes lexical information (the vocabulary) only. Lexi- 
cal information is, of course, of paramount importance when recovering from 
lexical errors, but in general the whole range of linguistic information is useful 
when correction hypotheses are generated, particularly syntax and to some ex- 
tent semantics. It is pretty obvious that the proper correction for ' aboue ' in 
utterance (1) is 'above', but without the use of contextual information there 
is little evidence to distinguish this hypothesis from, say, ' about ' . Whereas 
syntactic information is useful for handling close calls as in this example, it is 
absolutely crucial when dealing with so-called real-word misspellings. 

==> show price for volvo 300 from the rear 1988 (2) 



1 Some of the example utterances in this and subsequent chapters originate from the dia- 
logue corpus mentioned above (These utterances have the precursor ==>.) The corpus is in 
Swedish, so the utterances here are literal translations where the crucial aspect (often a lexical 
error) of the utterance has priority over good linguistic style. Utterances with the — > precur- 
sor are not literal translations, but have been tampered with slightly, or simply invented to 
better illustrate a particular phenomenon that is hard to literally translate. Hyphens that do 
not wrap a line indicate that the Swedish source token is a noun compound, e.g. the Swedish 
source token for 'maintenance-cost' is 'underhallskostnad' 



A word has been substituted for a token that is a valid word in the vocabulary. 
It is likely that the user in (2) intended to type 'year' but it accidentally came 
out 'rear'. The real-word errors are obviously harder to come to terms with 
than the nonword errors. Information other than lexical is necessary just to 
detect the problem spot. Few researchers have addressed the real- word error 
problem, and some of these have tended to focus exclusively on this problem, 
forgetting the easier nonword error problem. 

The other major error category is the segmentation error category. Segmen- 
tation errors somehow involve word boundaries. There are two types: run-ons 
and splits. 

==> finally, can I have a list of these cars with information (3) 
onspaciousness 

==> just the ones with coupe space 3-4 (4) 

Utterances (3) and (4) illustrate a run-on and a split respectively. In a run-on 
two (or more) words have been run together into a single token. In a split one 
word has been split into two (or more) tokens. The split in (4) should have 
been written 'coupe-space' 2 . As is evident from utterances (3) and (4), the 
real- word - nonword error distinction applies to run-ons and splits as well as to 
misspellings. Errors involving word boundary infractions are more difficult to 
handle than those that do not. Virtually all systems that process text in any way 
rely on a tokenizer to split the text up into word tokens, where the assumption 
is that a sequence of characters surrounded by space characters corresponds 
to a word. When this assumption is violated things go awfully wrong since an 
unknown token, like ' onspaciousness ' in utterance (3) for example, is assumed 
to be a misspelling of a single word. Furthermore, a segmentation error can 
generally be 'repaired' in more ways than can a misspelling and in general more 
linguistic information is needed to distinguish the good hypotheses from the bad 
ones. Recovering from segmentation errors is obviously hard, and this is one of 
the reasons why this problem has scarcely been addressed at all before. 

This thesis aspires to take a collected approach to the entire range of lexical 
errors, utilizing syntactic and to some extent semantic information in addition 
to the essential lexical knowledge source. This has not been done before and 
it requires in particular that correction hypotheses pertaining to different error 
types can be compared for discrimination in a meaningful way. 

Inspiration can be gained from Connected Speech Recognition. Consider the 
fictional utterance below where especially the segmentation problem has been 
exaggerated in absurdum. 

— > Whtisthemaitneancecostofrtherespetivemdelsintheaboetalbe (5) 

Finding the words in the heavily distorted (5) is in many ways similar to finding 
(phonemes and) words in continuous speech. There is no indication in continu- 
ous speech as regards the end of one word and the start of the next, i.e. there 

2 The Swedish source is 'kupe utrymme', which is not the same as 'kupeutrymme'. 
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is no space character counterpart in speech. Furthermore, since the 'alphabet' 
in the speech case is made up of an infinite number of 'characters' (real-valued 
feature vectors), a word is never 'spelled' the same way twice. The point made 
here is that if we agree to carry the error types of text processing over to speech 
processing, we see that speech is virtually littered with misspellings and seg- 
mentation errors. It is therefore close at hand to see what the methods used 
in the difficult speech recognition task can offer in the relatively simple text 
recognition problem 3 . 

The following chapter describes and delineates the problem domain and in- 
troduces the terminology used in this thesis. Chapter 3 "Error Profile of a 
Dialogue Corpus" describes the types and frequencies of errors found in a Nat- 
ural Language dialogue corpus. Chapter 4 gives a brief review of prior work in 
the area of lexical error correction. Chapter 5 "An Algorithm for Robust Text 
Recognition" holds the technical contributions of this thesis. It describes prob- 
abilistic methods used in speech processing, and how they can be incorporated, 
adapted and put to use for the present task. Chapter 6 "Experimental Eval- 
uation" evaluates the techniques described in Chapter 5 on two error corpora 
extracted from two different applications, or scenarios. The first is a dialogue 
application and the other is a transcription typing scenario. The thesis con- 
cludes with Chapter 7 "Future Work" . 



3 One plausible interpretation of (5) could be (1) (with 'aboue' substituted for 'above'). 



Chapter 2 

The Lexical Error Problem 



A problem faced by the analysis component of any NLP-system is that the 
input sometimes does not conform to the expectations of the system's develop- 
ers. One of the reasons for this may be that the input is erroneous, ill-formed. 
In this situation the system needs to react in some way, and the proper reac- 
tion depends, amongst other things, on the application at hand and the type 
of error encountered. Errors that occur in natural language text can violate 
linguistic expectations on all levels: the lexical, syntactic, semantic and the 
pragmatic/discourse level. The class of lexical errors is the one that has been 
most thoroughly studied, and in some respects the least problematic one. In 
many cases it is possible to guess at what the user intended to write and hence 
corrections can be automatically proposed. Some of the syntactic error types 
can be corrected with varying degree of success, but in general this error type 
calls for alternative reactions on the system's part. Semantic and pragmatic 
errors are the hardest. 

In this thesis we are primarily concerned with the lexical error problem. 
We look at this problem from the perspective of a Natural Language Dialogue 
system, from which we have gathered a dialogue corpus. The study of this 
corpus (presented in Chapter 3) involves not only the lexical errors but also 
certain syntactic problems. Thus the lexical error problem formulation of this 
chapter in Sections 2.1 and 2.2 is supplemented with Section 2.3 "Syntactic 
Errors" to provide the taxonomy of syntactic errors used in Chapter 3. Semantic 
and pragmatic errors are not addressed here. Veronis [1991] gives a relatively 
comprehensive account of semantic problems and Carberry [1984] describes some 
of the pragmatically ill-formed phenomena that can appear in the context of a 
dialogue system. 



2.1 Misspellings 

Any misspelling can be described as a transformation from a correctly spelled 
word performed by one or several of the basic error operations: 



CHAPTER 2. THE LEXICAL ERROR PROBLEM 



• 



• 



deletion (e.g. 'deltion') 
insertion (e.g. 'innsertion') 
substitution (e.g. ' subatitution') 
transposition (e.g. 'trasnposition' ] 



The basic error operations are not primitive operations, nor do they provide 
a unique path from word to misspelling. Rather, their usefulness lies in their 
correspondence to real world error-creating operations and their ability to in- 
terconvert any pair of strings. The basic error operations can be used to de- 
scribe lexical errors but they are not very good at explaining them. Distinctions 
are made between typographic errors (performance errors), cognitive errors and 
phonetic errors (competence errors). In the case of typographic errors it is 
assumed that the typist knows the correct spelling of the word but makes a 
simple motor coordination slip. The substitution of 'ten' for 'the' is a typical 
typographic error. In the case of cognitive errors there is a lack of linguis- 
tic competence on the part of the typist. An example of a cognitive error: 
'receive' — > 'recieve'. A phonetic error, which is really a sub-class of the 
cognitive errors, is a word that is phonetically correct but orthographically in- 
correct ('memories' — ► 'memerys'). Although the distinctions are useful for 
sorting out human spelling behavior, they are seldom used when it comes to 
designing a spelling correcting program. The reason is that it is generally quite 
hard to determine the underlying cause of the error; 'recieve' for example 
may just as well be an accidental transposition error, and even if the cause can 
be established it is not certain that it will help the spelling corrector. Another, 
more important distinction is the one between nonword errors and real-word 
errors. 



• 



• 



A nonword misspelling is one that results in a string not in the vocabulary 
(of the system) 

A real-word misspelling is one where a valid (correctly spelled) word is 
substituted for the intended word. 



The substitution of 'wether' for 'whether' in 'wether to be or not to be 
... ' is an example of a real- word error. By definition a real- word error can not 
be detected by the use of a system's lexical knowledge. A syntactic analysis may 
or may not detect the error depending on the relationships between the words. 
Apart from detecting real-word errors, the context is often helpful in deciding 
amongst alternative correction candidates for nonword errors. The traditional 
way of correcting the detected nonword error usually involves some sort of dis- 
tance metric, cf. [Kukich, 1990]. The metric is used to compare the erroneous 
token to the valid words in the dictionary, and to choose the word that is closest 
to the misspelling. The problem with many of these algorithms is that in many 
cases there are several candidates that are equally 'close'. In 'pass me the 
salr please', the erroneous 'salr' should probably be 'salt', but viewed 
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in isolation ' sale ' is just as likely. In these situations contextual dependencies 
are useful. 

The example of ' wether ' above hints at a problem related to spelling errors, 
namely that of the unknown word problem. In most NLP-systems 'wether' would 
not be part of the vocabulary, and hence it would not be a real word but 
rather a nonword error. In this particular example it causes no problem since 
the intended word was 'whether', but if 'wether' was the intended word, as 
in: 'Is that our wether grazing over there? ', problems arise. The tricky 
bit is to decide whether 'wether' is a misspelling of a known word or if it is 
a correctly spelled word that just happens to lie outside of the dictionary's 
coverage. Extragrammatical problems such as the unknown word problem is not 
further addressed in this thesis, except in the discussion in Chapter 7 "Future 
Work" . 

Another distinction that is often made between misspellings is that of single 
error misspellings and multiple error misspellings. 

• A single error misspelling is an error where one of the four basic error 
types has occurred exactly once 

and consequently 

• A multiple error misspelling contains more than one instance of the four 
basic error types. 

The distinction may seem a bit strange, but it was discovered already in the 
sixties (see below) that a large portion of all spelling errors was due to single 
error misspellings, and thus many of the techniques for automatic spelling error 
correction developed over the years have focused on this error type. If for no 
other reason, this makes the distinction interesting for comparative purposes. 

When examining studies of spelling error corpora it is important to notice 
that spelling error frequencies and error patterns vary significantly between 
different applications. 

In an early study Damerau [1964] found that as many as 80% of the words 
rejected by the list of acceptable terms in an information retrieval system were 
single error misspellings. The term-list will only reject nonword errors of course, 
so the remaining 20% are presumably multiple error misspellings (Damerau does 
not mention segmentation errors). 

Kukich [1992a] made an error profile of a 40,000 word TND 1 (Telecommuni- 
cations Network for the Deaf) transcript corpus. She found 78% of the nonword 



'TND is a service that AT&T provides for their speech- and hearing-impaired customers. 
A person can have a TDD (Telecommunications Device for the Deaf) device with which she 
can communicate with other people with speech- and/or hearing-impairments who also have 
a TDD. A TDD is much like a terminal that can be hooked up to the wire via a modem, and 
lets the user type messages on a keyboard and receive messages on a screen. A TDD user 
can also communicate with a voice phone user by calling a deaf relay center, where a relay 
operator, using both a TDD and a voice phone, reads the text typed by the TDD user to the 
voice phone user and listens to the voice phone user's response and types the words spoken 
by the voice phone user back to the TDD user. 
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errors in the corpus to be single error misspellings, corroborating the findings of 
Damerau. Some 18% of the misspellings contained a mistake in the first char- 
acter position. It is generally believed that errors tend not to occur in the first 
character position, and the figure may be unreprcscntativcly high. Twcntyseven 
percent contained errors involving an adjacent key. The layout of the keyboard 
is, of course, relevant when motor coordination slips occur. Two percent were 
phonetic errors: 'cuz', 'becuz', 'u', 'ur' and 'rite'. Phonetic error is per- 
haps not the most adequate characterization for these words. The slang-like 
abbreviations can be seen in informal conversations over the net, and unknown 
word may be a better term. 

Pollock and Zamora [1983] collected over 50,000 nonword misspellings from 
around 25,000,000 words of text from a number of different scientific and schol- 
arly databases. They found, amongst other things, that the multiple error 
frequency was as low as 7.5%. The rest of the nonword misspellings were dis- 
tributed over the four basic error types: deletions 34%, insertions 27%, sub- 
stitutions 19% and transpositions 12.5%. The low error frequency (0.2%) is 
probably partly due to the nature of the text source. 

Mitton [1987] studied the spelling behavior of secondary school pupils (15 
years of age) in a corpus of short essays on the subject of "Memories of my 
primary school" . The essays were hand- written. The corpus included 924 es- 
says and a total of 170,016 words. Mitton recorded 4,218 errors, an error rate 
of only 0.25%. The low error frequency can partly be explained by the fact 
that the essays were written by hand, thus excluding the keyboard error-source. 
It is not clear from Mitton's report, but it appears as though the pupils were 
aware of the fact that they were being tested on their spelling skills. This may 
also have had an affect on the error frequency (perhaps in either way). Mit- 
ton reports a relatively high percentage of real-word errors. He found 40% to 
be real-word errors and consequently 60% nonword errors. In the real-word 
error category Mitton includes errors that are generally considered to be syn- 
tactic errors, such as present-tense verbs in place of past ('the best thing 
I like was to play') and number agreement errors ('five other primary 
school '). Excluding these errors from the real- word error category, the per- 
centage is still as high as 30%. When a system's (incomplete) dictionary is used 
there will be a smaller amount of real-word errors and instead a corresponding 
amount of unknown words. Another remarkable finding in Mitton's report is 
the unprecedented high frequency of phonetic errors. Homophones and near- 
homophones made up close to 60% of the errors in the corpus. Near-homophones 
are words like 'where' and 'were' which arc homophones to some people and 
' have ' and ' of ' which can be homophones in ' I might of done ' . The high 
frequency of homophones is hard to explain, but perhaps it has to do with the 
fact that the pupils were given a spelling transcription test before they were 
asked to write the essay. This might have put the pupils in a "sound-to-text 
spelling mode" , which might explain the high homophone frequency. 

Peterson [1986] set out to determine the probability of a word being mistyped 
as another word, as a function of the size of the word-list. With the terminology 
introduced in this section this means: What is the probability that a single basic 
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typographic error will result in a real- word error when the size of the word-list 
varies? Due to lack of reliable statistics the four basic error types were assumed 
to be equally likely. For an n-letter word and 28 characters in the alphabet (26 
letters, hyphen and apostrophe) there are n + 28(n+ l) + 27n + n — 1 = 57n + 27 
possible single error misspellings 2 . A word-list of 369,546 words was run through 
a program that generated all the possible mistypings for each word. Of the 
possible 205,480,845 mistypings 988,192, or 0.5%, turned out to be another 
valid word. Variably sized word-lists were constructed by stripping off the most 
infrequent words from the original word- list. The smallest word-list would thus 
contain only the most frequent words, which are generally shorter words: ' the ' , 
'of, 'and', 'to', 'a', 'in', 'is', 'that', 'it' and 'he' make up the top 
ten. The word frequencies were estimated from different corpora. Short words 
are more likely to be real- word errors if mistyped than longer words. By weighing 
the fraction of all possible errors that are real-word errors against the expected 
frequency of occurrence, and varying the size of the word-list, Peterson found 
that in running text the expected frequency of single error real-word errors 
caused by typographic mistakes could be as high as 16%. The effect of the more 
frequent shorter words is apparent in that the 100,000 most frequent words 
resulted in 13% real-word error probability, while the remaining 300,000 words 
only added a further 3%. 

2.2 Segmentation Errors 

Segmentation errors are errors in which word boundary markers are involved. 
The most prominent of the boundary markers is the space character, whose sole 
objective is to delimit words. The space character is a member of the white- 
space characters which also include tab, carriage return and line-feed. These 
characters also have other functions besides delimiting words, and the same goes 
for parenthesis, period, comma, slash etc. 

The basic error operations of deletion, insertion, substitution and transpo- 
sition when applied to word boundaries result in things like: 

— > He gaveher roses (6) 

— > He ga ve her roses (7) 

— > He gavehher ro es (8) 

— > He gav eher roses (9) 

The segmentation errors in (6) through (9) are more or less likely to appear 
in actual texts. In general (6) and (7) are more common than (8) and (9) 
since they are not only caused by accidents, but also arise from cognitive and 
phonetic misconceptions. Segmentation errors can, just as regular misspellings, 
be diagnosed as cognitive, phonetic and typographic errors. The examples (6) 

2 dclctions, insertions, substitutions and transpositions respectively. 
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through (9) are obviously all typographic errors. The accidental substitution of 
'h' for ' u ' and ' u ' for ' s ' in (8) is probably unlikely to occur very frequently 
in actual texts because of the form and placement of the space-bar; it is unlikely 
that one accidentally hits the space-bar instead of one of the character keys or 
vice versa. Example (6) and the first erroneous token in (8) are called run-ons. 
Example (7), the second erroneous token in (8) and example (9) are called splits. 

• A run-on segmentation error has occurred when two or more words are 
written as one token, and it is a multiple run-on if more than two words 
are involved. 

• A split segmentation error has occurred when a word is written as two or 
more tokens, or, when two words are split up into two erroneous tokens, 
and it is a multiple split if more than two tokens are involved. 

The Example (9) requires special consideration. From the point of view of the 
basic error operations it is quite arbitrary whether (9) should be counted as a 
run-on or as a split, or as a combination of both, but since the result of the 
transposition is two tokens the best way to describe example (9) is as a split 
and hence the definition above. 

Run-ons and splits may, of course, like misspellings result in real-word errors. 
Particularly in the case of splits that are caused by cognitive and/or phonetic 
misconceptions it is quite likely that at least one of the tokens is a real word 
('already ' — >'al ready'). 

• A real-word run-on has occurred when the token is a valid word in the 
vocabulary. 

• A real-word split has occurred when at least one of the affected tokens is 
a valid word in the vocabulary. 

The problem of detecting and correcting segmentation errors has received 
much less attention than that of misspellings. A reason for this is that segmen- 
tation errors are less frequent than spelling errors in naturally occurring texts. 
Another reason, and most likely the main reason, is the complexity inherent in 
the segmentation problem. In the case of segmentation errors context is not 
only helpful, as with most misspellings, but essential. 

Virtually all spelling correction techniques rely on a tokenizer to split the 
character input stream into tokens. Tokcnizers are often simple programs that 
just look for white-space characters and other delimiters to determine token 
boundaries. The assumption is that a token corresponds to a word in the vo- 
cabulary. When the input stream has been tokenized, each token is checked 
against a dictionary of valid words and if the token is not in the word-list, it 
is flagged as an error, i.e. an error involving one word. In actuality this means 
that the tokenizer defines the problem-spot. Assume for a moment that the to- 
kenizer is right in its assumption that the unrecognized token is the misspelling 
of exactly one word. As mentioned above, the traditional approach to spelling 
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error correction relies on the computation of the distance between the erro- 
neous token and the words in the vocabulary. With an M word vocabulary, 
M distances have to be computed. However, if the constraint enforced by the 
white-space during tokenization is relaxed, the error correction task becomes 
considerably more complex. Turning again to utterances (6) to (9), if a single 
erroneous token appears in the input stream it might be a run-on. If the token 
contains n characters it can be split in n ways and each splitting results in two 
candidate tokens. If any of the splittings results in two perfect matches, (utter- 
ance (6)) this yields a plausible correction. However, it is possible that perfect 
matches are unobtainable (the first erroneous token in utterance (8)). In this 
case both candidate tokens of the hypothetical run-on have to be compared to 
the M words of the vocabulary resulting in 2nM distance computations. The 
same line of reasoning can, of course, be applied to splits. The problem that 
faces the error recovery program is that it can not make any assumptions re- 
garding the type of the error if it wants to be sure to find the best correction 
alternative. When misspellings, run-ons and splits, along with single as well as 
multiple errors, and even combinations of misspellings and segmentation errors, 
are considered, the traditional approach to lexical error recovery has lost its 
applicability. This is most likely the main reason why so little interest has been 
shown in segmentation errors. 

In addition to this there is the real-word segmentation error problem which 
means that legal words surrounding the unrecognized tokcn(s) have to be taken 
into account in the generation of corrections as well. Consider the pathological 
laboratory sentence: 

— > Her an together be fore shere ached these a (10) 

This mumbo-jumbo sentence contains one nonword, ' shere ' , the rest of the 
words are all legal. The problem of resegmenting (10) is quite hard, even for 
humans. The white-space characters carry a significant amount of information 
and when they are misplaced, interpretation gets hard. Although the segmen- 
tation problem never gets as hard as in (10) in processing of actual texts, there 
are other applications that need to deal with similar situations. In speech recog- 
nition, OCR and handwriting decoding the segmentation problem is a primary 
concern. (In the two latter cases character segmentation is actually more acute 
than word segmentation.) Speech recognition suffers from problems of coartic- 
ulation, homophones and, of course, there is no space character counterpart in 
speech. A speech-like reformulation of (10) could be written as: 

— > Herantogetherbef oreshereachedthesea (11) 

The segmentation of (10) and (11) that is likely to be the 'intended' one is: 

— > He ran to get her before she reached the sea (12) 

Segmentation problems are less frequent in texts than misspellings. In many 
applications, however, it is a problem that needs to be addressed. 
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In her study of the nonword errors in the 40,000 word TND corpus Ku- 
kich [1992a] found that 13% of the errors were run-ons and 2% were splits. 
Apparently the majority of the segmentation errors were caused by typographic 
slips such as 'yesthisis' and 'sp ent'. The investigation also indicates that 
a high percentage of the run-ons involve a relatively small set of high-frequency 
function words. 

In Mitton's [1987] study of errors in 15-year-olds' essays, segmentation errors 
were also considered. Contrary to Kukich, Mitton found splits to be more 
common than run-ons. Fourteen percent of the errors were splits but there were 
only 3% run-ons. Whereas it was very rare for the run-ons to result in a real 
word, there were only five cases (0.12%) where a split word did not result in 
neither of the halves being a real word. In 13% of the cases both halves were 
real words and in the remaining cases (1%) one of the halves was a real word 
('to gether' and 'evry body' e.g.). Mitton also found that the (generally 
shorter) function words were more liable to result in real-word errors than the 
content words. The real-word error portion of the function words in error was 
66%, and the corresponding figure for the content words was 33%. 

2.3 Syntactic Errors 

The aim of this section is to introduce the error categories that are used to diag- 
nose the error corpus in Chapter 3, not to fully cover all the syntactic errors and 
peculiarities that can appear in texts. For a more in-depth and linguistically- 
oriented account of these issues see, for example, Baker et al. [1990], Vero- 
nis [1991], Carbonell and Hayes [1983], Kwasny and Sondhcfmer [1981], Hayes 
and Mouradian [1981]. 

Syntactic errors depend on the structural relationships between words in a 
sentence. At the surface level however, a sentence is a sequence of words just as 
a word is a sequence of characters. From this standpoint it is natural to simply 
upgrade the basic lexical error operations to the syntactic level. This results in 
the basic syntactic error operations: 

• missing word e.g. 'the plays in the backyard' 

• extra word e.g. 'the the boy plays in the backyard' 

• substituted word e.g. 'the boy plays but the backyard' 

• transposed words e.g. 'the boy plays the in backyard' 

Just as any misspelled word can be transformed into a legal word by combi- 
nations of the basic lexical error operations, so can an ungrammatical sentence 
with the basic syntactic error operations. For practical purposes, however, this 
classification is too crude. In the typology used here the basic error operations 
will account primarily for performance errors. Most of the examples above are 
likely to be accidental, possibly with the exception of the substituted word error. 
A variant of the substituted word error, which is generally a competence error, 
is the agreement error. 
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• Agreement errors include: 

— subject-verb number and person errors, e.g. 'they plays in the 
backyard' 

— wrong case of pronouns, e.g. 'them play in the backyard' 

— noun phrase problems, e.g. 'they play in a backyards ' 

A subtle distinction has been introduced here regarding the real-word mis- 
spelling (see Section 2.1) and the substituted word error. Both error types are 
concerned with the substitution of one word for another and both are perfor- 
mance errors. The difference is that the real-word misspelling is a misspelling, 
i.e. it is an error on the character level. The substituted word error is something 
other than a misspelling and it is an error on the vocabulary level. Hopefully 
the confusion can be reduced by an example of a substituted word error: 

==> Is there a medium-sized-car in the 50000-70000 (13) 

price-range and has year-of-production 1988? 

This strange looking utterance would read perfectly well if ' and ' was replaced 
with 'that' or 'which'. It is hard to say what the user was thinking about 
when she typed (13), but one thing is clear and that is that 'and' is not 'that' 
or 'which' misspelled. Thus (13) is an example of the substituted word error 
type. Utterance (2) in Chapter 1 is an example of a real-word error. 

A varied and wide-ranging phenomenon, that is also quite frequent in the 
corpus examined in the following chapter, is the elliptic utterance. The elliptic 
utterances are not really errors, since they are generally unproblcmatic to under- 
stand in the context in which they appear, at least for humans. In some sense, 
however, they are errors, they violate an imagined core grammar of how legal 
declarative, imperative and interrogative sentence are formed in a language. 
Henceforth these sentences will simply be referred to as elliptic, avoiding having 
to classify them as being ill- or well-formed. 

Various distinctions can be made between different types of ellipses (cf. [Car- 
boncll and Hayes, 1983, Lavelli and Stock, 1990]). For our purposes, however, 
two types will suffice: the telegraphic ellipsis and the contextual ellipsis. In a 
telegraphic ellipsis one or, more often, several words have been intentionally left 
out. It is generally easy to state that a fragment has been left out on purpose, 
and this makes it easy to distinguish the telegraphic ellipsis from the missing 
word error type. Further, the elliptic fragment, the left-out portion, is always 
scmantically redundant. Two examples of telegraphic ellipses, one at each end 
of the scale: 

==> looking for other models in the same price-range that (14) 

would not be expensive to maintain 

==> mercedes fuel-consumption (15) 

The subject and the copula have been left out in (14) and many people would 
not consider it an error. In (15) it is more obvious that fragments have been left 
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out. However, in the dialogue context in which it appears, it is unproblcmatic 
to interpret the input: 'What is the fuel-consumption for mercedes'. 

In contextual ellipses the elliptic fragment is also left out intentionally, but 
is not scmantically redundant, it can be inferred from the context. In a dia- 
logue system, it can usually be inferred from one of the immediately preceding 
utterances, which may be system or user generated. An example of a contextual 
ellipsis: 

==> What is the impact-safety for the cars with (16) 

rust-protection better than 3 

The system responds ... 

==> better than 4 (17) 

When the second utterance is entered, it is not hard to understand what 
is meant, basically the portion 'What is the impact-safety for the cars 
with rust-protection' should be put first in the second utterance to yield 
the intended query. 



Chapter 3 

Error Profile of a Dialogue 
Corpus 



The need for robust text processing techniques became apparent during the 
development of the natural language dialogue system linlin [Ahrenberg et ai, 
1990, Jonsson, 1995]. linlin is a dialogue interface to a database, and it accepts 
queries in natural language (Swedish) and produces SQL-queries that are fed to 
the DBMS, linlin can be adapted to different application databases. In order to 
try to determine the required functionality of the dialogue management module, 
a series of experiments was conducted using the Wizard of Oz data collection 
scheme, cf. Dahlback et al. [1993]. The idea behind the Wizard of Oz technique, 
in short, is to let an operator, a human being, perform the task of the dialogue 
interface. In this case the operator interprets the NL input, she then accesses 
the SQL-database and relays the database output to the subject, or she herself 
replies with canned text or manually types back a short reply. 

In the following two sections error profiles are given of two different corpora 
collected using the Wizard of Oz technique. The motivation for the study is 
to get an idea of the error frequencies in this sort of application. How are the 
errors distributed over the error types? What is the most 'urgent' robustness 
functionality? Which linguistic knowledge sources are in the foreground for 
resolving/interpreting the different types of ill-formcdness? 

The corpora (and the sections) are named CARS and travel. Slightly more 
attention will be devoted to CARS since it is one of the corpora that have been 
used for evaluating the techniques presented in this thesis (see Chapter 6). In 
the CARS-application the database contains information on different car mod- 
els. The subject can retrieve information about the model's price, fuel con- 
sumption, top speed etc. The task given to the subject is to decide which car 
to buy given certain financial restrictions. The TRAVEL-application is a travel 
agency scenario. The subject's task is to decide on a charter trip to the Greek 
archipelago. She has to choose an island and a hotel, when to travel etc. The 
experimental setups differ slightly in the two cases. In CARS the operator reads 
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the subject's input from the screen and constructs the corresponding SQL-query 
that is submitted to the DBMS. The output from the database is simply for- 
warded to the subject's screen. The output is thus generally in table format. 
Occasionally the subject asks for clarification of table output, and sometimes 
the subject queries the 'system' for information that is not included, in which 
case the operator replies using canned text or hand-typed short messages. In 
TRAVEL no actual database system is involved, the operator simply manipulates 
a large collection of canned texts organized in such a way that the operator can 
simulate a database system interface. The travel interface also exhibits some 
degree of multimodality in that the operator is able to display maps of many of 
the tourist locations. 

In both experimental setups and for all subjects the operator is instructed 
to be 'forgiving' with respect to ill-formed input, i.e. the operator will respond 
to all queries that she can understand. Note that the data collection was not 
carried out with ill-formed input as a topic of investigation. 



3.1 CARS 

The CARS corpus contains 20 dialogues gathered from 20 different subjects. Ten 
of the subjects were led to believe that they were communicating with an actual 
natural language interface, and the other ten were informed of the fact that the 
interface was simulated by an operator. As far as ill-formed input is concerned 
the two sub-corpora are quite similar. Below the sub-corpus collected with the 
misled subjects is called machine, and the other half is called operator, the 
distinction reflecting the subject's beliefs. The two sub-corpora are presented 
separately in the tables below for transparency. CARS contains 369 user utter- 
ances, 3,139 word tokens and 584 word types. Table 3.1 shows how many of the 
utterances are well-formed, elliptic and how many contain at least one error. 



Utterances 


MACHINE 


Corpus 

OPERATOR 


CARS 


Well-formed 

Elliptic 

Ill-formed 


116 
13 
37 


70% 

8% 
22% 


132 65% 

24 12% 
47 23% 


248 67% 
37 10% 

84 23% 


Total 


166 


100% 


203 100% 


369 100% 



Table 3.1: The distribution of utterances 



Of the 121 (37 + 84) ill-formed and elliptic utterances 94 (78%) contain 
one elliptic or erroneous construction and 27 (22%) contain more than one. 
In the 121 ill-formed and elliptic utterances there is a total of 161 individual 
errors/ellipses in CARS. The distribution over the error types is displayed in 
Table 3.2. 

The relatively small size of the corpus makes the error frequencies sensitive 
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Error Types 


MACHINE 


Corpus 

OPERATOR 


CARS 


Misspellings 


22 


31% 


40 


45% 


62 


39% 


Run-ons 


14 


19% 


3 


3% 


17 


11% 


Splits 


7 


10% 


6 


7% 


13 


8% 


Lexical errors (Y]) 


43 


60% 


49 


55% 


92 


57% 


Missing Constituent 


4 


6% 





0% 


4 


2% 


Extra Constituent 





0% 





0% 





0% 


Substituted Constituent 


1 


1% 


4 


4% 


5 


3% 


Transposed Constituent 





0% 





0% 





0% 


Agreement Error 


6 


8% 


9 


10% 


15 


9% 


Syntactic errors (Y2) 


11 


15% 


13 


15% 


24 


15% 


Telegraphic Ellipsis 


7 


10% 


18 


20% 


25 


16% 


Contextual Ellipsis 


11 


15% 


9 


10% 


20 


12% 


Ellipses (J2) 


18 


25% 


27 


30% 


45 


28% 


Total 


72 


100% 


89 


100% 


161 


100% 



Table 3.2: The distribution of errors and ellipses in CARS 



to individual subjects' strange behavior. Ten of the 14 run-ons in machine, 
for example, are the work of two particularly careless subjects. One of these 
utterances reads: 

==> I want to see cars inprice-range 20000 to 70000 of (18) 

makesaudi , bmw,f ord,mazda, toyotapeugeotvolkswagen 

The fluctuations between machine and OPERATOR are probably due more to 
the choice of subjects for the respective experimental setups than to the setups 
themselves. 

Utterance (18) contains three run-ons, the first two being single errors and 
the last a multiple error. Note that the absence of spacing in the comma- 
separated enumeration of the makes 'bmw', 'ford' and 'mazda' does not con- 
stitute an error. Although the enumeration violates typographic conventions 
regarding spacing in conjunction with punctuation characters, a tokenizer can 
resolve this problem by simply triggering on characters like comma. 

A system with no robustness built into it, and without the ability to deal 
with elliptic constructions, could theoretically parse 248 (67%) of the utterances 
in CARS (from Table 3.1). It is interesting to see how the theoretic performance 
of the system improves when robustness functionality is added. 

Envisage three robust modules: an automatic spelling and segmentation 
error correction module, a module that interprets telegraphic and contextual 
ellipses and a module that can recover from the basic syntactic errors of missing 
constituent, extra constituent, substituted constituent, transposed constituent 
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and also the agreement errors. The figures in Table 3.3 show the theoretical 
performance improvements brought on by these modules had they been used on 
CARS. A small number of utterances are hard to interpret for reasons other than 
those discussed here. One utterance contains a definite reference to objects not 
displayed or mentioned before, and there are three occurrences of prematurely 
ended inputs. The single word utterance (19) is an example of the latter case. 



=> show 



(19) 



All utterances are included in Table 3.3 although these oddities are disre- 
garded. 



Robust Module 


CARS 


None 


248 


67% 


Lexical 


+58=306 


83% 


Syntactic 


+14=262 


71% 


Ellipses 


+36=284 


77% 


All Modules 


+121=369 


100% 



Table 3.3: Utterances theoretically parsable with different robustness capacities 



Note that the figures in Table 3.3 require the modules to have 100% both 
recall and precision. The figures do not properly add up because there are 
utterances that contain instances of different error types, and hence can not be 
completely corrected by any single module in isolation. 

Table 3.3 strongly emphasizes the need for robustness in this type of applica- 
tion. The non-robust system has a theoretic parsability maximum of 67%, and 
if the system can resolve elliptic constructions it is 77%. The single, potentially 
most productive module is the lexical error recovery module. It is noteworthy 
that a system with the ability to handle ellipses and lexical errors has its max- 
imum at 94% ((248 + 58 + 36 + 6)/369). (There are six utterances that contain 
both lexical errors and ellipses but no syntactic error.) The lexical errors are 
therefore the most interesting error category, at least from the point of view of 
building a useful dialogue system. 

The lexical error rate, the number of error tokens per token, in CARS is 
2.9%. The lexical errors can be orthogonally divided into non/real words and 
single/multiple errors. Table 3.4 shows how the lexical errors are distributed 
over these categories. Misspellings are naturally divided into the four cate- 
gories since a misspelled word is always realized as a single token regardless 
whether it is a single, multiple, nonword or real-word error. Segmentation er- 
rors are not as straightforward (cf. Section 2.2). The ratio of single error 
misspellings among the nonwords (57%) is considerably lower than the 80% 
reported by Damerau [1964]. The real- word error frequency (7%) could be 
on the lower side and it is certainly below the frequency reported by Mit- 
ton [1987], but then again, Mitton's findings may not be representative for 
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an application such as the present one. The real-word error splits are of two 
kinds: strange use of slash ('petrol/mile' — >' petrol u / u mile' , the multiple 
error in the table, and 'dollars/year '— >' do llars/ u year'), and split num- 
ber notation ('40000 '^'^0^000'). Since 'petrol', 'mile', 'year' and '40' 
are all in the vocabulary, these are all real- word errors. This of course raises 
the question of what words to put in the vocabulary. Special characters like 
' / ' , ' ( ' , ' ) ' are troublesome because they do not belong in the vocabulary 
and yet they carry meaning. Numbers should definitely not be included in the 
vocabulary. These issues are further discussed in Section 5.4. 







Nonword error 


Real-word error 


Total 


Missp. 


Single error 
Multiple error 


49 
12 


57% 
14% 


1 



17% 
0% 


50 54% 
12 13% 


Run-ons 


Single error 
Multiple error 


14 
3 


16% 

3% 






0% 
0% 


14 15% 
3 3% 


Splits 


Single error 
Multiple error 


8 



9% 
0% 


4 
1 


67% 
17% 


12 13% 

1 1% 


Total 


Single error 
Multiple error 
Total 


71 
15 
86 


83% 

17% 

100% 


5 
1 
6 


83% 

17% 

100% 


76 83% 
16 17% 
92 100% 



Table 3.4: Breakdown of lexical errors in CARS 

Looking at Table 3.4, the 'easy' errors are the 49 misspellings that are non- 
words and singletons. Although it is the single largest class of errors, it only 
accounts for slightly more than half of all the lexical errors. Error profiles taking 
segmentation errors into account are hard to find and so there is not much to 
compare with, but it seems that the 33% (30/92) segmentation error rate is high 
compared to what others have found. Kukich [1992a] reports that 15% of the 
lexical errors in her TND corpus are segmentation errors. 

3.2 TRAVEL 



The travel corpus contains 20 dialogues gathered from 20 subjects who were 
not aware of the role played by the operator, travel consists of 717 utter- 
ances, 3,882 word tokens and 941 word types. There are 424 (59%) well-formed 
utterances, 184 (26%) elliptic and 109 (15%) ill-formed utterances. There is a 
comparatively larger portion of ellipses in travel compared to CARS, particu- 
larly telegraphic ellipses. The explanation for this probably lies in the different 
structures of the two domains and the fact that the travel domain is sup- 
plied with maps. The travel domain is more hierarchically structured than 
the CARS domain. There are islands in the archipelago, resorts on the islands, 
hotels in the resorts and so on. The maps also naturally focus the dialogue, and 
let the subject express herself in a telegraphic manner, without the utterance 
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being hard to interpret for the operator. If the subject has a map of a village on 
Rhodes on the screen with the hotels marked out, an utterance like (20) seems 
to come naturally. 



=> standard these hotels 



(20) 



The way that different modalities are exploited in a dialogue system certainly 
has an effect on the dialogue itself and on the occurrences of, for example, 
ellipses, but that is not our prime interest here. 



Error Types 


TRAVEL 


Misspellings 


81 


24% 


run-ons 


21 


6% 


splits 


14 


4% 


Lexical errors (J2) 


116 


34% 


Missing Constituent 


6 


2% 


Extra Constituent 





0% 


Substituted Constituent 


12 


4% 


Transposed Constituent 





0% 


Agreement Error 


9 


3% 


Syntactic errors Q^) 


27 


8% 


Telegraphic Ellipsis 


159 


47% 


Contextual Ellipsis 


36 


11% 


Ellipses (£) 


195 


58% 


Total 


338 


100% 



Table 3.5: The distribution of errors and ellipses in travel 

Apart from ellipses CARS and TRAVEL are very homogeneous. The distribu- 
tion of the lexical errors for example among the error types is almost exactly 
the same for the two corpora. 

The lexical error rate in travel (3%) is only slightly higher than that of 
CARS. Table 3.6 shows that again the nonword single error ratio (64/103 = 
62%) is far below the 'agreed upon' 80% reported by Damerau and others. 
The segmentation error ratio is still high, 30% is only slightly down from the 
32% found in CARS. The real-word errors, of which there are only splits, are 
almost exclusively errors due to lack of linguistic competence. ' boat -trips ' — ► 
'boat u trips' and 'shark-attack' — > ' shark u attack' arc examples of these 
cases. These errors do not translate very well into English; there is, however, 
one error in the corpus that more clearly demonstrates the nature of the error 
in the original language: 



=> the water drink able 



(21) 
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nonword error 


real- word error 


total 


Missp. 


Single error 
Multiple error 


64 
17 


62% 
17% 






0% 
0% 


64 55% 
17 15% 


Run-ons 


Single error 
Multiple error 


16 

5 


16% 

5% 






0% 
0% 


16 14% 
5 4% 


Splits 


Single error 
Multiple error 


1 



1% 
0% 


11 
2 


85% 
15% 


12 10% 

2 2% 


Total 


Single error 
Multiple error 
Total 


81 

22 

103 


79% 

21% 

100% 


11 

2 

13 


85% 

15% 

100% 


92 79% 

24 21% 

116 100% 



Table 3.6: Breakdown of lexical errors in travel 



Besides the real- word error split, utterance (21) is also a telegraphic ellipsis. 



3.3 Conclusions 



The motivation for adding robustness functionality to an NL dialogue system 
is to increase the number of utterances that the system can accurately analyze, 
and for this purpose 

• Lexical errors are more urgent than other errors. 

The results in Table 3.3 show that this is true for CARS. Having to choose one 
of the robust modules, the system would benefit the most from the lexical error 
recovery module. This is also the case in the travel domain as far as lexical and 
syntactic errors are concerned, although there are a greater number of elliptic 
expressions compared to CARS, especially telegraphic ellipses. Lexical errors 
can, and should, be automatically corrected in a dialogue system. As reagrds 
syntactic errors this is not as obvious, and for ellipses this is probably not a very 
good idea at all. Telegraphic ellipsis utterances, the largest ellipsis category, 
contain all of the semantic information that is needed to interpret them (cf. 
utterance (15)). The fact that these utterances are often syntactically erroneous 
should not be a major consideration in trying to find the proper interpretation of 
the utterance. It is probably better to circumvent the syntactic constraints than 
attempting to exploit them when this sort of phenomenon occurs. The same 
sort of reasoning can be applied to some of the syntactic errors. The lexical 
errors, however, can not be circumvented in any way. Out of the 91 lexical 
errors in CARS 82 (90%) are content words and 9 (10%) are function words; the 
numbers for travel are 103 (89%) and 13 (11%) out of a total of 116 lexical 
errors. A large portion of the content words are so-called domain words, i.e. 
words with a strong domain association. Fiftynine percent of the content words 
in CARS are also domain words. In travel the domain word ratio is as high as 
73%. There is a high degree of unfamiliar words in travel, (Greek) names of 
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resorts and such, which can explain the differences. It is difficult to produce a 
meaningful interpretation of an utterance containing a lexical error without the 
use of lexical error recovery methods, and if the error involves a content word, 
or even worse a domain word, it is generally speaking impossible. 
The corpus investigation clearly shows that 

• The lexical error situation here is more severe than normal. 

The dialogue scenario puts a higher cognitive load on the subject, compared to 
many other applications. The subject is concerned with extraction of informa- 
tion, not orthography. This, together with the fact that the operator is forgiving 
with respect to errors, (the subject notices that she can get away with lexical 
errors and becomes less careful,) can explain the frequent errors which are also 
hard to correct. In Kukich's [1992a] application, which is quite similar to the 
present one, the error rate is as high as 5-6%. The other applications cited in 
the previous chapter all have lower error rates than the 3% found in CARS and 
travel, with Pollock and Zamora [1983] reporting the lowest error rate 0.2%. 
Taking the real-word errors into account as well the error rate is close to 3%. 
Apart from the high error rate there is also a relatively small portion of 'easy 
cases' in the dialogue corpora compared to what others have found. The least 
difficult lexical errors are the nonword single error misspellings, which have been 
found by several researchers to make up 80% of the nonword errors. The figures 
here are 57% (cars) and 62% (travel). The remainder of the lexical errors 
are consequently distributed over the harder cases and it is evident that 

• Wide error scope is crucial in a dialogue application. 

The consequence of not addressing the segmentation errors, for example, will be 
that these are treated as misspellings, which is a foolproof way of providing a 
lexical error recovery module with poor performance. Researchers generally do 
not pay much attention to segmentation errors, though Kukich [1992a] reports 
that run-ons and splits make up 15% of all the lexical errors in her corpus. 
The ratio is considerably higher here, 32% of the lexical errors in CARS being 
segmentation errors and the corresponding figure for travel is 30%. The hard 
cases, the segmentation errors, the multiple errors and the real-word errors 
emphasize the fact that 

• Contextual dependencies are crucial. 

There are generally more alternatives to be considered where segmentation er- 
rors and multiple errors are concerned. Local contextual preferences may then 
be used to distinguish the good hypotheses from the bad ones. A lexical re- 
covery module is needed that can handle both misspellings and segmentation 
errors and has the ability to deal with some of the problems in the smaller but 
harder class of real- word errors. 

Many automatic spelling error correction techniques have been developed 
to address the nonword misspellings. Kukich [1992b] performed a comparative 
study of some of the most well-known isolated word error correction algorithms. 
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The test set contained 170 human-generated misspellings, of which 25% were 
multiple error misspellings. The best algorithms scored around 80% correction 
accuracy. Applied to the CARS corpus (see Table 3.4, Section 3.1), this means 
that the best isolated word error corrector can correct slightly more than 50% 
of the lexical errors in the corpus (.8 x (49 + 12)/91 = .54). This limit needs to 
be raised. 
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Chapter 4 

Background 



The research area of automatic spelling error detection and correction is nearly 
as old as the computer. Kukich [1992b] provides an excellent review of the 
varied and wide-ranging area spanning the last few decades. 

The (incomplete) survey of the research field presented in this chapter will 
review some of the more influential and/or interesting contributions. Special 
interest is devoted to the error scope of the proposed technique, whether or not 
contextual information is used and, of course, the performance of the algorithms 
(to the extent that authors quantify their results). 

The review below is divided into three sections roughly corresponding to 
three general 'schools-of-thought' in automatic spelling correction. Section 4.1 
"The Classical Method" describes the string edit distance idea and techniques 
akin to it. Section 4.2 "Noisy Channel Methods" reviews the probabilistic ap- 
proach and Section 4.3 "Error Correction in Rule-Based NLP Systems" surveys 
some of the attempts at robust natural language processing where spelling cor- 
rection is generally just one of several problem areas that are addressed. 



4.1 The Classical Method 

One of the earliest and probably the most influential contributions to the area 
of human-generated spelling error correction techniques was that of Damcrau 
in 1964. Damerau [1964] found that approximately 80% of all nonword mis- 
spellings were also single error misspellings. Based on these findings Damerau 
subsequently implemented what was later to be named the minimum edit dis- 
tance algorithm. The algorithm detects errors by comparing the input word to 
a dictionary. When an error is detected, the program tries to transform the 
input word into a legal word in the dictionary relying on the assumption that 
exactly one of the basic error operators has 'produced' the error. When a match 
is found, the process is halted and the dictionary word is suggested as correction 
for the input word. 

Independently of Damerau, Levenshtein [1966] developed a similar technique 

25 
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in the research discipline of error correcting binary codes. The Levenshtein 
Distance (LD) is the distance between two words in terms of deletions, insertions 
and reversals (transpositions). The idea of the LD metric algorithm is to choose 
the word in the dictionary that is closest to the erroneous input word. The LD 
metric algorithm is sometimes used synonymously to the minimum edit distance 
algorithm, or rather, the term minimum edit distance algorithm refers to both 
ideas nowadays. 

Several authors have extended the LD metric algorithm. Okuda et al. [1976] 
introduced the Weighted Levenshtein Distance (WLD). This is a generalization 
of the LD algorithm that can correct garbled words containing multiple instances 
of character substitutions, insertions and deletions. The weights can be used 
to give preference (shorter distance) to one error type over the others. The 
authors were satisfied with their algorithm, stating that: "our method achieved 
higher error correcting rates than any other method tried to date". Okuda et 
al., however, noticed that short words constitute a serious problem. 

The character n-gram technique is another class of methods used for spelling 
correction of various kinds. Angell et al. [1983] implemented a technique that 
computes a similarity measure between an input word and the words in the 
dictionary based on trigrams. The similarity is computed as the fraction of 
trigrams that are common to the two strings. 

The method achieved an overall accuracy of 76% on a test set of 1,544 
misspellings using a dictionary of 64,636 words. The authors noticed that the 
correction rates for transpositions were very poor (36%), it was actually worse 
than for multiple errors (55%). It was noticed that more transpositions occurred 
in shorter words on the average and short words were also problematic for the 
trigram similarity technique. 

The SPEEDCOP system [Pollock and Zamora, 1984] is one of the techniques 
devised to correct single error misspellings by ways of similarity keys. Each word 
in the vocabulary is given a key and when a misspelling is detected, its key is 
also computed and compared to the set of precompiled keys corresponding to the 
words in the dictionary. The words with identical or similar keys are considered 
as correction alternatives. The similarity keys can be computed in a number of 
ways, in this case the ordering of the consonants are emphasized [Pollock and 
Zamora, 1984]. The authors tested their program on over 50,000 misspellings 
gathered from seven different scientific databases with a 40,000 word dictionary. 
SPEEDCOP scored an overall correction rate of 74-88% correction rate for the 
different databases, counting only the misspellings whose corresponding word 
was in the dictionary. SPEEDCOP uses some complementary correction aids 
besides the similarity keys, (it actually uses two sets of similarity keys). One 
such complementary aid is the so-called "function word routine" which looks for 
concatenations (run-ons) of frequent function words (e.g. 'of their', 'inthe'). 
It is not completely clear exactly how the function word routine operates, but 
it is claimed that it improves the overall correction rate by 1-2%. 

What the "classical" techniques above have in common is that they look 
at words in isolation, which puts real-word errors outside the scope of these 
techniques. Furthermore, based on the findings of Damerau, most of the isolated 
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word error correction programs only consider singletons. Segmentation errors 
are not addressed at all, with SPEEDCOP, which addresses a subset of the run- 
ons, as the outstanding exception. The striking observation is that segmentation 
errors are usually not even mentioned, and this goes for the entire area, not just 
the classical methods. It is often unclear whether or not errors due to defective 
tokenization are included in the test sets. 

There seems to be a limit of somewhere around 80% correction rate for 
the isolated word correction algorithms (cf. Kukich [1992b]). One of the major 
problems with the classical methods is that the ranking of alternative correction 
candidates is fairly imprecise, i.e. there will generally be a number of correction 
alternatives that are equally 'good' and there is no way to decide which should 
be preferred. A more fine-grained (although not necessarily better) measure of 
'goodness' is provided via probabilities. 



4.2 Noisy Channel Methods 

The noisy channel idea is based on the metaphor that communication is re- 
layed via an imperfect medium, the channel. A word is inserted in one end of 
the channel and from the other end comes the distorted version of that word. 
Given the distorted word and the characteristics of the channel, the job is to 
calculate the most likely word to have been inserted originally. This likelihood 
is estimated via conditional probabilities. If there is also a language model that 
has information on what words are likely to be inserted, it provides what are 
known as the prior probabilities. (Sec Chapter 5 for more details.) 

The noisy channel model has a long tradition in the neighboring research 
areas of optical character recognition (OCR) and speech recognition. Especially 
in OCR the approach seems natural since the channel (OCR-device) is there, 
open for inspection. In automatic correction of human-generated spelling errors 
it is not as straightforward since the channel characteristics are more elusive. 
However, with the increasing availability of large corpora, the noisy channel 
model has found its way into computational linguistics in recent years. 

Researchers who look to the noisy channel model for automatic spelling error 
correction can roughly be divided into two groups: those who emphasize the 
prior probabilities with the intention of correcting real-word errors, and those 
who emphasize the channel's distribution with nonword errors as prime target. 

Kcrnighan, Church and Gale (KCG) were among the first to use really large 
text-corpora to estimate the channel characteristics for the purpose of spelling 
correction. KCG [1990, 1991b, 1990] constructed a program, CORRECT, that 
generates and ranks corrections for words rejected by spell 1 . The program 
generates correction candidates for a typo by applying a single instance of one 
of the basic error operators to the typo, and then it ranks the candidates us- 
ing a Bayesian scoring function. The spelling corrector can thus handle single 



lr The Unix SPELL program [Mcllroy, 1982] is a fast wide-covering spelling error detection 
program that uses elaborate hash-functions to store the vocabulary for random access. Sec 
also Domcij et al. [1995]. 
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error nonword misspellings. The scoring function is based on the modeling of 
the typing process as a noisy channel. The most likely correction for a typo 
t is the correction candidate c that maximizes P(c)P(t\c). The probabilities 
are estimated from the 1988 AP newswire corpus (44 • 10 6 words). The condi- 
tional probability P(t\c), is computed from four confusion matrices that contain 
the number of deletions, insertions, substitutions and transpositions that were 
recorded for the individual characters in the words rejected by spell. 

CORRECt's scoring function was tested on 329 misspellings where the pro- 
gram generated two correction candidates. Three human judges were asked to 
choose among the alternatives, given the rejected word, the two alternatives and 
a few concordance lines of context. CORRECT agreed with the majority of the 
judges in 87% of the cases. 

During the work on the CORRECT program it was noticed that human judges 
were reluctant to decide between alternative candidate corrections given only 
the information available to the program, i.e. the typo and the candidate cor- 
rections. The judges were much happier if they could see a line or two of the 
typo's surrounding context. Consequently KCG went on to furnish CORRECT 
with a bigram model of local context. The Bayesian scoring function of a cor- 
rection candidate c given a typo t used in the original CORRECT, P(c)P(t\c), 
the prior probability and the channel probability, is complemented with the 
probability that the word to the left of c is I, P(l\c), and the probability that 
the word to the right is r, P(r\c). The intention of KCG was to sec whether 
the updated scoring function: P(c)P(t\c)P(l\c)P(r\c) would enhance the per- 
formance of CORRECT. The test set of the 329 nonword errors rejected by 
spell for which CORRECT generated exactly two correction candidates was 
again used for testing. Using only the prior and channel probabilities COR- 
RECT agreed with the judges in 87% of the cases. Using the bigram proba- 
bilities, performance rose to almost 90% which the authors found to be sig- 
nificant. They also discovered that the bigram parameter estimator was cru- 
cial to the behavior of the program. Maximum likelihood estimation and ex- 
pected likelihood estimation actually degraded performance or made no differ- 
ence, respectively, to the original CORRECT. The only estimation technique to 
improve performance was the Good- Turing estimation technique [Good, 1953, 
Church and Gale, 1991a]. 

Kashyap and Oommen [1981, 1984] had investigated the same approach some 
years earlier although they had used subjective estimates of the parameters 
describing the channel. Their results range from 30% to 92% correction accuracy 
depending on the word length and number of errors per word. They also report 
that the figures compare favorably with those reported by Okuda et al. [1976] 
which ranged from 28% to 64% for words with similar characteristics. 

Mays et al. [1991] took the other route, the one that focuses on the correction 
of real-word errors. The prior distribution is modeled with a trigram language 
model: 
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P(W) = l[P(Wi\Wi-2,W i - 1 ) 



The trigram language model has relatively good predictive power but the 
channel-model used by Mays et al. is quite simple. Each word Wi in the vocab- 
ulary has a precompiled confusion set wf. The confusion set is generated by 
applying the four basic error operators to each character position of the word 
exactly once and adding the resulting string to the confusion set if it is a legal 
word in the vocabulary. Wi is also added to wf. The confusion set of a word thus 
contains all the single error misspellings that arc legal words in the vocabulary. 
P(yi\u)i), where yi is the real-word error, is then simply computed as: 

1(1 — a)/(|w?| — 1) otherwise 

where a is a constant determined by experimentation (a = .9, .99, .999, .9999 in 
the reported experiments). Note that all correction candidates for yi in w\ have 
equal probabilities, except for Wi. The technique was tested on 8,628 sentences 
using a 20,000 word vocabulary. The test sentences were generated from 50 
sentences of AP newswire and 50 sentences from the Canadian Parliament tran- 
scripts. These 100 original sentences were manipulated so that each of the 8,628 
test sentences contained exactly one single error real-word error. The technique 
proposed by Mays et al. detected and corrected 73% of the 8,628 single error 
real- word errors. It is safe to say that it is the trigram model that does the 
work, the channel model merely enumerates the candidates where the trigram 
scores low. Mays et al. [1991] used the trigram language model employed in the 
IBM speech recognition project [Bahl et al., 1983]. 

Golding and Schabes [1996, 1995] took the idea of Mays et al. in a slightly 
different direction. They also used confusion sets but these were not based 
on the basic error operations. Rather, the confusion sets were selected on the 
basis of the fact that they occurred frequently in the Brown Corpus [Kucera 
and Francis, 1967] and sampled from the list of "Words Commonly Confused" 
in Random House [Flexner, 1983]. The confusions represent different types 
of errors: homophone confusions {'peace' , 'piece'}, grammatical confusions 
{' among' , 'between' } and the authors also added some typographical confu- 
sions that were not found in Random House, e.g. {'being' , 'begin'}. Golding 
and Schabes contrasted three different language models: a part-of-speech (POS) 
trigram language model, a feature-based Bayesian scoring function and a hybrid 
of the two. The POS trigram worked better than the Bayesian scoring function 
and the hybrid outperformed both. The technique was tested on 20% of the 
Brown Corpus. The evaluation is rather small-scale in that only 18 confusion 
sets are used. The confusion sets are also quite small, they contain only two 
or three words each. Two tests were conducted, one to see whether the system 
would wrongfully change the correct usage of a word in a confusion set into the 
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incorrect usage, and one to see if the system could restore the correct usage of 
the word if presented with an error. The performance varies for the 18 different 
confusion sets, results ranging from 35.3% to 98.4% on the corrupted words, 
and from 87.8% to 100% on the uncorrupted words. 

Atwell and Elliott [1987] took claws [Garside, 1987] as a starting point to 
detect real-word errors. CLAWS is a program that assigns POS tags to a text 
using a POS bigram language model. The idea is simply to assign POS to the 
text of interest using CLAWS and when the probability of two consecutive tag- 
pairs fall below a predefined threshold, the word whose tag is present in both 
tag-pairs is marked as an error. The authors collected a 13,500 word corpus 
and extracted 502 real-word errors. The real-word error detector scored a 62% 
recall and 35% precision. The way that the threshold is set is obviously crucial 
to the performance of the method. In an attempt to better the poor precision 
rating Atwell and Elliott raised the threshold slightly and precision rose from 
35% to 38%, but in doing so recall dropped from 62% to 47%. These results 
indicate that the discriminative powers of the POS bigram probabilities alone 
are too weak. This is particularly evident when an absolute value (threshold) 
is used for making the decisions. Comparing alternative tag-pair sequences to 
each other is surely a more promising approach. 



There is obviously a greater interest in the more challenging real- word errors 
among the researchers adopting the noisy channel approach. The focus is con- 
sequently on the language model, often at the expense of the channel character- 
istics. The confusion set idea works satisfactorily for a small set of precompiled 
errors introduced in the channel, but is left without a chance to correct an error 
that does not belong to one of the confusion sets. One can, of course, argue 
that one should have a complementary program that performs nonword error 
correction and that the real-word error correction algorithm should be applied 
on the output from this program. The problem, however, is that if the real-word 
error correction module proposes corrections based solely on the preferences of 
the language model, regardless of orthographic similarity between error and 
correction, it will introduce errors that were not originally there. 



The introduction of contextual properties in the spelling correction algo- 
rithms is certainly beneficial. Contextual properties are useful not only for the 
purpose of real-word error correction. KCG use the contextual information to 
be able to better discriminate between competing correction alternatives to non- 
word errors. Although there is nothing in principal that prevents one from using 
contextual dependencies also in the "classical approaches" , it seems as though 
they are more naturally incorporated into the noisy channel model. 



None of the authors cited above mentions segmentation errors. 
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4.3 Error Correction in Rule-Based NLP Sys- 
tems 

This section takes a wider view on errors in natural language text. The aim of 
an NLP-system is to analyze, interpret, translate or whatever the application 
might be, as many input sentences as possible, and for this reason errors other 
than just spelling errors are, of course, of interest. 

The need for error-handling capabilities became evident as real users were 
let loose on the early prototype NLP-systems. As a consequence the early eight- 
ies saw a range of NLP-systems with robustness functionality built into them. 
These systems can be divided into three general categories [Kukich, 1992b]: the 
relaxation-based (e.g. [Heidorn et at, 1982, Weischedel and Sondheimer, 1983]), 
expectation-based (e.g. [Carbonell and Hayes, 1983, Granger, 1983]) and the 
acceptance- based techniques (e.g. [Fass and Wilks, 1983]). 

The acceptance-based approach works under the assumption that errors can 
be ignored as long as an interpretation can be found that is meaningful in the 
given application. The approach is grounded in the observation (hypothesis) 
that this is the way humans deal with erroneous or disturbed input. Acceptance- 
based approaches tend to make extensive use of semantic information and little 
use of any other level of linguistic information. 

The expectation-based technique derives its expectations from various lin- 
guistic knowledge sources. CASPAr/dypar [Carbonell and Hayes, 1983] use 
syntactic and semantic expectations expressed in a case frame whose slots ex- 
pect to be filled. Carberry [1984] generates responses to pragmatically ill-formed 
dialogue input based on the user's goals and plans that have been inferred from 
the preceding dialogue. 

Contrary to the acceptance-based approach the relaxation-based technique 
assumes that no errors can be ignored. This assertion is largely based on the 
fact that many early NLP-systems depended heavily on syntactic rules, and 
when the input violated the rules, the parsing process simply came to a halt. 
The relaxation-based technique will try to find the rule that when relaxed will 
allow the parser to succeed. If this scheme works, it means that the error 
is both localized and diagnosed and can consequently also be corrected. The 
relaxation-based approach is the one that has been favored in the research com- 
munity, compared to the other two approaches. However, there are still prob- 
lems. When only syntactic constraints are applied, there are many rules that 
when relaxed will lead to a complete parse and there will be a large amount 
of correction alternatives. Mellish [1989] noticed this problem trying to correct 
unknown/misspelled words, omitted words and spurious words with relaxation 
techniques using a CFG. Ingels [1992, 1993] tried the same approach, but with 
a richer grammar formalism that allowed also for feature values to be relaxed, 
and experienced the same problem. 

After the hectic years in the early eighties interest in robust NLP seemed 
to have faded somewhat, but only to rise again in the late eighties and early 
nineties. Two important systems from this period are critique [Richardson and 
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Braden-Harder, 1988] and a text-editing system for Dutch [Kempen and Vosse, 
1990, Vosse, 1992]. Both systems are relaxation-based but only the latter takes 
spelling errors seriously. 

The critique system [Richardson and Braden-Harder, 1988], with its ances- 
tor the IBM epistle system [Hcidorn et al., 1982], is one of the few systems to 
sincerely address robustness issues on a large scale, i.e. wide covering, natural 
language text processing system. The dictionary includes more than 100,000 
entries and the words also carry information used in syntactic processing. The 
grammar contains several hundred phrase structure rules. The critique sys- 
tem accepts text as input and delivers critique on grammar and style at levels 
of detail that can be adjusted by the user. The original epistle system could 
cope with different types of grammatical errors whereas the rule-based parser in 
critique "... provides a unique approximate syntactic parse for a large per- 
centage of English text and diagnoses over 100 grammar and style errors" . After 
an initial preprocessing phase the robust parsing proceeds along the following 
lines: Lexical processing identifies words not in the dictionary and assigns them 
default morphological and syntactic information to avoid parsing failure. After 
the lexical analysis the text is passed to the parser. If the parser fails to produce 
a complete parse, the parser goes over the text once more, this time with some 
of its constraints relaxed. Certain lexical substitution rules are also activated 
during this second pass. The substitution rules involve easily confused words 
such as whose^->who's or its<->it's. If parsing now succeeds, the relaxed rules will 
serve as a basis for the critique fed back to the user. If this still does not succeed 
in a complete parse after the relaxation phase, the 'parse fitting' procedure is 
invoked [Jensen et a/., 1983]. The parse-fitting procedure relies on heuristics 
to choose a head constituent to which fragments produced by the parser in the 
relaxation phase can be attached to form an approximate parse. Even when 
parse-fitting is applied, grammar and style error critique can be produced for 
the incomplete fragments. In any of the processing phases described multiple 
parses may result and in this case the system selects one based on a parse metric 
which favors trees in which modifying words and phrases are attached to the 
closest qualifying constituent [Hcidorn, 1982]. 

The accuracy of critique was tested on 10 essays from each of four groups: 
freshman compositions, business writing, ESL (English as Second Language) 
and professional writing. The diagnoses made by CRITIQUE were classified as 
correct, useful or wrong, useful meaning that the detection of the error was 
satisfactory but that the critique was off the target. The authors did not consider 
errors that the system missed, strangely enough, critique produced the correct 
advice in 39% (professional), 54% (ESL), 72% (freshman) and 73% (business) 
of the critiques. When useful critiques were taken into account, figures rose 
in all four categories, particularly for the ESL-group (54%^87%), but as the 
authors point out, useful critique may not be that useful to users who lack native 
intuitions about English. 

Another parser-based text proof-reading system is that developed for Dutch 
by Kempen and Vosse [Kempen and Vosse, 1990, Vosse, 1992]. The system can 
correct nonword spelling errors, real- word syntactic errors such as agreement 
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errors, it can also handle word doubling errors, problems with idiomatic expres- 
sions and compounds. Some structural errors such as punctuation errors and 
strange word order errors can also be dealt with. The robustness is achieved 
mainly in two processing phases, the word- level processing and the sentence- 
level processing. In the word-level processing the linear order of the text is 
abandoned for a lattice structure. The lattice reflects ambiguities that arise 
from compounds, idiomatic phrases and word doublings. If the text contains 
spelling errors a correction module is invoked and a limited number of correc- 
tion alternatives are added to the lattice. The correction algorithm is based on 
a variation of trigram analysis [Angell et at, 1983] and triphone analysis [van 
Berkel and de Smedt, 1988] extended with a ranking mechanism. The dictio- 
nary contains 250,000 word-forms. After the word level processing the sentence 
level processing can proceed (on the lattice) . The parser is a shift-reduce parser 
working with an Augmented Context-Free Grammar (ACFG) of some 500 rules. 
If an agreement constraint is violated during parsing, the constraint is relaxed 
and appropriately marked as such, and parsing continues. Structural errors are 
dealt with in a different manner. These (unusual according to the authors) 
errors are parsed with error rules included in the grammar. When parsing is 
finished, the job is to select among the (sometimes very frequent) alternative 
parses. The most straightforward method is to simply count the number of 
errors in the parses and choose the one with the least number of errors in it. 
However, this instrument is too blunt since there may be many parses with the 
same number of errors. To make the ranking of alternatives more fine-grained 
the grammar rules have weights added to them. For example, verb transitivity 
violation is more heavily penalized than incorrect subject verb agreement. After 
the selection phase the text can be regenerated with suggested corrections and 
diagnostics messages. 

The word level processor was tested on 1,000 lines of text randomly chosen 
from two large texts submitted for publication, one on employment legislation 
and the other concerning collective wage legislation [Vosse, 1992]. The sample 
contained almost 6,000 words with 30 nonword misspellings. Twcntyeight of the 
misspellings were detected and 14 were given the proper correction. The two 
missed misspellings were assumed by the system to be proper names. Eighteen 
false alarms were produced. Compared to an elementary spell-checker (suppos- 
edly the sort of spell-checker that comes with word processors) the word-level 
processor performed well. A simple spell checker, on the same text, marked 
217 words as misspelled, which amount to 187 false alarms, 37 abbreviations 
and proper names and 150 compounds. The authors report that the word level 
can process in excess of 25 words per second. The sentence-level, however, re- 
quires considerably more time. Processing time ranges from four or more words 
per second for short error-free sentences to several seconds per word for longer 
and more error-prone sentences. On a 150 sentence spelling test for secretaries 
and typists, the system was able to correct 72 out of 75 errors without any 
false alarms. The errors corrected were spelling errors, agreement errors and 
errors in idiomatic expressions. The three errors missed involved semantic vi- 
olations. The 150 sentences were processed in under nine minutes. Although 
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the spelling correction algorithm of van Bcrkel and de Smedt [1988] works in 
an environment rich in linguistic information, it is actually an isolated word 
correction technique. That is, the correction alternatives are generated without 
contextual information. The contextual information is supplied afterwards to 
disambiguate among the multiple alternatives. 

The problem of disambiguation has a long history in NLP research. The 
problem is to find the correct word-sense for a word in a particular context where 
the word has more than one interpretation in the vocabulary. This problem is 
very hard and requires large amounts of linguistic knowledge on all levels, and 
in the general case also extra-linguistic knowledge of the world. 

The problem of lexical error recovery can also be seen as a problem of dis- 
ambiguation, but on a different level, the string level. There is certainly no 
conflict of interests here, both problems are important and need to be solved. 
The question is rather: what are the crucial knowledge sources in the respective 
problems? 

There is obviously a fair amount of randomness involved in how lexical er- 
rors are produced. Probabilistic techniques are well suited (although not always 
exploited) to capture the randomness of spelling behavior, which is one of the 
fundamental ideas behind the noisy channel approach. Still, there are clearly 
patterns there as well, van Berkel and de Smedt [1988] focus on phonetic re- 
semblance which is one of several distinguishable patterns. Emphasizing one 
type of pattern is bound to obscure others. Techniques based on probabilistic 
or statistical methods usually derive their model's parameters from a corpus, 
and in the case of lexical error recovery, the corpus would consist of errors. This 
implies that as long as we have a large enough set of example errors, preferably 
produced by real users, we can train or derive a model that can describe any 
pattern that might be in the corpus. The model would then 'encode' a ran- 
domness/pattern mix that reflects the contents of the corpus. The problem is 
of course that a large enough corpus is quite hard to come by. Probabilistic 
techniques also provide a sharp disambiguation instrument, there will always 
be one alternative that is better (more likely) than the rest (for better or for 
worse). The voluminous NLP-systems reviewed in this section contain a large 
set of hand-crafted rules. The proof-reading system of Vosse [1992] uses this 
rule-base (of primarily syntactic constraints) to disambiguate among the mul- 
tiple words and word-senses produced by the isolated word spelling correction 
program. That is, it uses information that really pertains to a different level 
of description. There is nothing wrong with that, it obviously produces useful 
results. The point, however, is that information pertaining to the string disam- 
biguation problem is overlooked. Another problem related to the use of large 
rule-bases for lexical error disambiguation is that it is hard to move to another 
domain or language. The isolated word spelling correction technique is not very 
useful without the syntactic constraints, so a new set of rules must be manually 
constructed before the spelling correction technique can perform satisfactorily. 
This is a laborious task. 

Although algorithms dealing with isolated word error detection and/or cor- 
rection work on words in isolation, the program usually processes chunks of 
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running text, depending on tokenization before the detection phase. The to- 
kenization process is often simplified so that errors involving word boundaries 
will be more or less impossible to correct. 

Carter [1992] implemented an elaborate tokenization and error-correcting 
algorithm in the CLARE system which explicitly considers segmentation errors. 
Carter maintains a lattice of overlapping word hypotheses and uses syntactic 
and semantic constraints to select the best alternative. CLARE was tested on 
102 artificially generated sentences containing 108 errors without the use of 
syntactic and semantic constraints. The system found a single correct repair in 
59 cases and 24 of these involved segmentation errors. It is not clear what the 
ratio of segmentation errors was in the source text. 

In spite of the work of Carter and others, Kukich [1992b] states that "the 
general problem of handling errors due to word boundary infractions remains 
one of the significant unsolved problems in spelling correction research" (p. 385). 
This is one of the main problems we deal with in the following chapters. 
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Chapter 5 

An Algorithm for Robust 
Text Recognition 



This chapter includes the theoretical contributions of this thesis. The contri- 
butions are principally condensed into two algorithms, one for isolated word 
error correction (Section 5.3 "Isolated Word Recognition"), and one for the 
correction of lexical errors in general in running text (Section 5.4 "Connected 
Text Recognition"). Sections 5.1 "Fundamentals of Hidden Markov Models" 
and 5.2 "Token Passing" introduce the building blocks and tools with which 
these algorithms are built. 

5.1 Fundamentals of Hidden Markov Models 

This section gives a brief account of the theory behind the discrete observation 
Hidden Markov Model (HMM). We present the components of the model and 
the basic algorithms that can be applied to it. These can of course be found in 
several other places [Rabiner, 1989, Levinson et al., 1983], but the account below 
is slightly different from the 'standard model'. In the theoretical underpinnings 
of Markov models there is no notion of final states as in the case of the FSA, for 
example. In real life, however, observation sequences are finite, at least the ones 
we are interested in. One way of thinking of final states of a Hidden Markov 
Model would be to somehow decide that a subset of the states are legal final 
states and that those of the complementary set are not. This approach can be 
found in, for example, Dcller et al. [1993] (p. 690). Another way would be to 
assign a probability to each state stating how likely it is that the particular state 
is a final state. This approach certainly blends better with the general theory, 
and more importantly it is trainable as will be shown below. 

The hidden Markov model presented here has two additional states compared 
to the standard model. These are called the entry state and the exit state. (The 
terms are borrowed from Young et al. [1989]. The terms entry and exit are 
preferred over start and final for reasons that will be become clear in Section 5.4.) 

37 
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As mentioned before, the notion of final states in Markov models is not new and 
the difference between the standard model and the one presented here is quite 
small, although not insignificant. An account of this type of Markov model and 
the fundamental algorithms associated with it has to my knowledge not been 
published before. 

A random process A_ is a sequence of random variables 

X_= {■ ■ .X t _i,X t ,X t+ i . . . } 

If the value of X t is dependent on the value of X t -\ but independent of earlier 
variables, i.e. 

P(X t = <z t |AVi - qt-i,X t - 2 = qt-2, ■ ■ • ) - P(X t = %|AVi - «t-i) 

we say that the process is a (first order) Markov process, and when the variables 
take discrete values, a Markov Chain. Given that the range space of the variables 
is finite the Markov Chain can be modeled by a finite state network where 
the states are associated with the outcomes of the random variables. Arcs 
connecting the states of the network impose transition probabilities between 
the states. The transition probabilities are often denoted 

aij = P(X t = qj\X t _i = q,j) 

If the transition probability in a state is independent of time t, the Markov 
Chain is said to be homogeneous. 

The Hidden Markov Model (HMM) models two parallel homogeneous ran- 
dom processes where one is the state transition sequence just described and the 
other is a sequence of observation symbols 

F = {...F t _i,F t ,y t+1 ...} 

The variables in Y_ take values from a discrete set V of observations or observ- 
ables and the observation symbol probabilities are denoted 

bj(k) = P(Y t = v k \X t = qj ) 

where Vk G V. The model is thus extended with an observation symbol distri- 
bution for each state. The HMM M. is thus determined by: 

• a finite set of states Q = {<?i , <?2 j ■ ■ ■ i9Af}j where q\ is the non-emitting 
entry state and qn is the absorbing non-emitting exit state 

• a finite set of observables V — {vi, Vi, . . . , Vk} 

• an (JV — 1) x (N — 1) transition matrix A where a^ denotes 
P{X t =q j \X t _ 1 = q i ) 

• an (TV — 2) x K observation matrix B where bj(k) denotes 
P(Xt = v k \X t = q J ) 
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There are no transitions out of qN and no transitions into q\ , thus the dimensions 
of the transition matrix A. The states q\ and qN do not emit any symbols and 
this explains the dimensions of the observation matrix B. The parameters N 
and K along with a specification of the observation symbol alphabet/vocabulary 
and the two distributions A and B determine the HMM M.. The shorthand for 
this is M = (A,B). 

The HMM above can be used as a generator of an observation sequence 

O = 01,02,... ,ot 

where ot, 1 < t < T is chosen from V. The workings of this abstract machine in 
generative mode is as follows: 

1. Start in the designated entry state q\ at time 1 t = 0. 

2. Transit to a new state, say qj according to the state transition distribution 
in the current state qi (ay). 

3. Set t = t+l. 

4. Choose Ot = Vk according to the observation symbol distribution in state 
Qj Q>j(k)). 

5. If t = T terminate by taking the transition to the exit state qN according 
to djN and set t = T + 1. if t < T go to 2. 

Setting t = T + I might seem strange when the observation sequence is only 
of length T . It is, however, theoretically unpleasant that Xt can assume two 
different values, so in the algorithmic descriptions below Xt+i will occasionally 
appear and its value will always be qN- In the implementation of the algorithms 
this fix is not necessary. 

Given an HMM M. — (A, B) and an observation sequence O = Oi, 02, ■ • • ot 
three important tasks can be performed. 

Task 1. Calculate the probability of the observation sequence O given the model 
M., i.e. P{0\M.). This can be done with the Forward- Backward procedure. 

Task 2. Calculate the joint probability of the observation sequence O and the 
optimal state sequence Q* given the model A4, i.e. P(0,Q*\A4). The 
Viterbi algorithm computes this probability. 

Task 3. Reestimate the model parameters A and B so as to maximize the prob- 
ability of a given observation sequence, the training material. The training 
procedure is called the Baum- Welch reestimation algorithm. 



1 In the speech application the notion of time refers to actual time, it has to do with sample 
rates and suchlike. In the text case, time is merely a metaphor. The discrete time points 
should be thought of as an ordering of events. 
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To achieve Task 1 in an efficient way we need the forward variables and the 
backward variables. These variables are used to store intermediate results of 
the forward-backward algorithm. Actually it would suffice with the forward 
variables or the backward variables to calculate P(0\A4), but since we need 
both sets later in the training procedure, both sets arc defined here. The forward 
variables a t (i) hold the probability of being in state qi at time t having observed 
the partial sequence o\ , . . . , Ot ■ That is 

a t (i) = P(oi,... ,Ot,X t =q t \M) 

The forward variables are recursively defined: 

Initialization 



1 < i <N 
a (i) 



1 if i = 1 
otherwise 



(5.1) 



Induction 



< t < T - 1 , 2<j<N-l 
nv-i 



"t+i(j) 



J2 a *( 



i) a. 



bj{o t +i) 



(5.2) 



Termination 



N-l 

P(0\M) = Y, aT ^ a * N 



(5.3) 



The purpose behind the entry and exit state should now start to become clearer. 
The vector ai, the transitions out of the entry state, is the equivalent of the 
initial state distribution usually denoted by the vector 7r in the standard model. 
The number an for example, gives the probability that the observation sequence 
starts in state qi. The entry state modification is there for practical purposes 
only, which will be explained below. 

The exit state has also a practical purpose, but it also conveys the final state 
idea. The number aiN, for example, gives the probability that the observation 
sequence ends in qi. It should be noted that the induction step does not include 
qN in any way. The transition to the exit state is taken when the entire obser- 
vation sequence has been observed. One can think of this as having some sort 
of end-of-sequence marker (or+i) after the last symbol that triggers the final 
transition to the exit state. cxt+i{N) would then equal the summation in the 
termination step and oiT+i{i) = for all i =/= N. There is, however, no point in 
adding the T + 1 column to the a-matrix. (The a variables are typically kept 
in a matrix, as is the case here.) 
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The backward variables j3t{i) hold the probability of making the partial 
observation Ot+i, ... ,ot and then taking the transition to the exit state given 
state qi at time t. That is 

/3 t (i) = P(o t +i,. ■■ ,o T ,X T+ i = q N \X t = q t ,M) 

Again the backward variables are recursively defined: 

Initialization 

1 <i <N-1 

Pr{i) = a lN (5.4) 

Induction 

T-l>t>0 , 1 <i< N-l 

N-l 

Pt(i) = Yl a iJ b j(°t+i)Pt+i{j) (5.5) 

One can think of the backward variables as being recursively computed going 
from last to first in the observation sequence. Note that (3r(i) means the prob- 
ability of being in the exit state next, given that the present state is Qi, i.e. 
#r(i) = P{X T +i = qN\X T = qt) = a iN . Note also that /3 (1) = P{0\M). 

The word Hidden in Hidden Markov Model comes from the fact that for any 
given sequence of observation symbols there can be many different underlying 
state sequences and it is impossible to say which one generated the particular 
observation at hand. The state sequence is hidden. There must, however, be 
one state sequence that is at least as likely to have produced the observation as 
any other. That is, given the observation sequence O — oi, . . . ,ot there is a 
sequence Q* = q*,. . . ,q^ for which 

P(0,Q*\M) > P{0,Q\M) -iQ^Q* 

That is, Task 2 above. Enumerating all the iV T possible state sequences and 
choosing the best is obviously not a realistic option. Fortunately there is an 
efficient solution to the problem, the Viterbi algorithm, another dynamic pro- 
gramming algorithm. To keep track of the partially optimal sequence on route 
to P(0, Q*\A4) we define the set of variables 

<Pt(j)= max [P(oi,... ,<H,qi,... ,qt-i,Xt = Qj\M)] (5.6) 

qi,...,Qt-i 

i.e. 4>t(j) is the probability of the most likely state sequence that ends in qj 
which also accounts for the first t observations. In order to retrieve the actual 
state sequence a second set of variables tpt(j) are needed. The tpt(j) keep track 
of the optimal predecessor of state qj in the path corresponding to <f>t{j). The 
Viterbi algorithm proceeds as follows: 
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Initialization 



1 <i <N 
ipo(i) 



1 if i = 1 
otherwise 



Induction 



1 < t < T , 2<j<N-l 
M) = 



max \<j>t-i(i)aij]bj{ot) 

Ki<N-l J ' J 



M.i) = argmax [^ t _i(i)ay] 

1 < i < N— 1 



Termination 



Path Backtracking 



P(0,Q*\M) 
(9t+i 
t = T-l,. 



max [^T(i)aiN] 

l<i<N-l 

argmax f0T(i)«jJvl 
i<»<jv-r 

Qn) 



,0 

4 = ^*+ife*+i.) 



(5.7) 
(5.8) 



(5.9) 
(5.10) 

(5.11) 
(5.12) 



(5.13) 



This algorithm is quite similar to the Forward Backward algorithm. The only 
real difference is that the summation in the computation of the forward variables 
is substituted for a maximization in the Viterbi algorithm. 

Recall Task 3: Given a model M. = (A, B) and an observation sequence O, 
adjust the parameters of M. so that P(0\M) is maximized. This task is not as 
simple as the two previous ones. In fact, there is no way to optimally estimate 
the model parameters. An iterative procedure such as the Baum- Welch reesti- 
mation algorithm can, however, be used to find a model that locally maximizes 
P(0\A4). In each iteration step a new model M. = (A,B) is estimated from 
the original model M. = (A, B). It can be shown that either the model M. 
defines a critical point where M. = M, or, P(0\M) > P(0\M). The iteration 
process stops when P(0\A4) — P(0\Ai) < e, for some suitably chosen e. The 
rccstimation of A and B can be described as: 



bj(k) 



expected number of transitions from qi to qj 
expected number of transitions from qi 

expected number of times in qj observing symbol Vk 
expected number of times in qj 



(5.14) 



(5.15) 
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Recall the definition of the a variables 

a t (i) = P(oi,... ,o t ,X t =q l \M) 

and the f3 variables 

Pt(i) = P(o t +i,--- ,o T ,X T+1 =q N \X t = qi,M) 

It is obviously not by pure chance that the a variables and (3 variables fit so 
nicely together. It is namely so that 

a t (i)P t (i) _ P(oi,.. . ,QT,X t = qj,X T+ i = g N \M) 
P(0\M) ~ P(0\M) (5.16) 

= P(X t = qi ,X T+1 = q N \0,M) 

i.e. the probability of being in state qi at time t and finally ending up in the exit 
state, conditioned on the observation sequence and the model. Summed over 
the time index t, the expression in (5.16) can be interpreted as the expected 
(over time) number of visits to state qi, or equivalently, the expected number of 
transitions from state qi. That is 

1 T 

— \ a t (i)(3 t (i) = expected number of times in q^ (5-17) 

yields one of the sought for quantities. In a similar way it is the case that 
a t (i)aijbj(ot+i)0 t+ i(j) 



P{0\M) 



P(X t = q t ,X t+1 = qj ,X T+1 = q N \0,M) (5.18) 



i.e. the probability of being in state qi at time t and qj at time t + 1 and being 
able to make the exit state transition given the observations and the model. 
And analogously 



T-l 



p<o\m) ^ a *W a *A(°t+i)A+i(j) 



/.=o 

expected number of transitions from qi to qj (5.19) 

can be interpreted as the expected number of transitions from qi to qj over time. 
Note that the exit state transition at time T has to be dealt with separately 
(see equation 5.21 below). 

Using the above equations, a convincing set of reestimation formulas would 
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be: 

1 < i < N - 1 , 2 < j < iV - 1 

T-l 



^2at{i)aijbj(o t +i)Pt+i(j) 



Oij - — ~ (5.20) 



t=o 



Ki < 7V-1 



a Z N = — (5.21) 

t=o 
2<j<N-l , \<k<K 

T 

6,(fc) = ^^ (5.22) 



E a t(0AW 



t=o 

The major difTcrcnce in the reestimation formulas compared to the standard 
model is the formula (5.21), the reestimation of the exit state transitions. The 
exit state transition can only occur when time t = T, thus there is no summation 
over time in the numerator of (5.21). The effect of the exit state transition at 
time t = T is also that the normalization of Oy is slightly different from the 
standard model. In the standard model the summation in the denominator goes 
from t = 0, . . . , T — 1. The reestimation formula (5.22) is the same as for the 
standard model. 

The prime intent with this section has been to show the small, but yet signifi- 
cant, difference between this model and the standard model. There are a number 
of practical implementation issues concerning HMMs that are consciously left 
out of this presentation. Some of these will surface briefly in the remainder 
of this thesis, but for a more comprehensive account the reader should consult 
e.g. Rabiner [1989] and Levinson et al. [1983] on the topics of scaling, multiple 
observation sequences, choice of model type (topology) and size and initial pa- 
rameter estimation. The problem of sparse data has been addressed in a number 
of papers, including Jelinek and Mercer [1980], Church and Gale [1991a] and 
Katz [1987]. 



5.2 Token Passing 

The term Token Passing originates with the researchers engaged in the speech 
recognition effort at the Engineering Department at Cambridge, UK [Young et 
al., 1989]. The problem of speech recognition can simplistically be described as 
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a recurring process of grouping sequential items to form more meaningful items. 
The speech waveform is sampled and processed for feature vectors, these are 
segmented into a sequence of phones, phones are grouped into words and from 
there, depending on the application of course, some sort of language model deals 
with the phrases and/or sentence(s). There are obviously intcrdcpcndcncies be- 
tween the different levels of this hierarchically laid out recognition process. The 
predictability of a certain phone sequence, for example, is of course depen- 
dent on what words are likely to occur in the particular context at hand. The 
recognition problem of the levels varies in difficulty, and the complexity of the 
problem is partly due to the strength of the dependencies on neighboring levels, 
the weaker the dependencies the harder the problem. Recognizing a sequence 
of phones as a word is considerably easier than finding phones in the feature 
vector stream. By transitivity one can say that all levels are dependent on each 
other. The main advantage of the Token Passing framework/algorithm (TP) 
is the elegant way in which dependencies between knowledge sources are main- 
tained, i.e. the coupling of neighboring layers. Another feature is the flexibility 
of the framework, alternative recognition algorithms can be used in various lay- 
ers. Furthermore it can quite easily be adapted to generate multiple alternative 
solutions and to incorporate search heuristics. 

In speech recognition as well as in text recognition there is a basic recog- 
nition unit. In the hierarchical speech recognition scheme laid out above the 
recognition of a phone as a sequence of feature vectors is the basic recognition 
task, hence the phone is the basic recognition unit. In the textual case one 
can have morphemes, lemmas or word-forms as the basic recognition units. We 
assume from here on that the word- form is the basic recognition unit. Each ba- 
sic recognition unit is represented by a finite state network, and each network, 
or model, holds a model identifier, the word- form string that it models. Each 
state i, j of the network is connected by a transition cost pij, and with each 
state j there is an associated local cost function dj(c). A path / = i\, . . . ,ix 
through the network represents one possible alignment of the network to the 
input characters C = Ci, . . . ,ct- Assume, for the time being, that C is a word 
form. 

Let each state of the network be capable of holding a passable token. At time 
t the token in state j holds the (partial) minimum cost alignment of c\ , . . . ,ct 
and the network, that ends in state j, i.e. the token represents the head of a 
path through the network. The minimum cost alignment 5 t (j) can be computed 
by the recursion 

St(j) = min [St-i(i) +Pi 3 ] + dj(c t ) (5.23) 

i 

At each discrete time point copies of tokens are propagated between states and 
the minimum cost alignment is updated according to equation (5.23). The 
algorithm is illustrated in Figure 5.1. The token is the box containing the cost 
St inside each state. 

The attentive reader has probably noticed the similarity between equa- 
tion (5.23) and the recursion formula (5.9) of the Viterbi algorithm above. The 
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Figure 5.1: Token propagation 



similarity is, of course, not accidental. This, however, should not lead to the 
conclusion that the TP algorithm is just an unorthodox way of implementing 
the Viterbi algorithm. It is more general than that, or more accurately, the 
passing of tokens within the basic recognition unit, equation (5.23), can be seen 
as a generalization of a number of other 'cost functions', e.g. many of the dis- 
tance metrics used in isolated word error correction. Consider, for example, the 
Weighted Levenshtein Distance (WLD) metric [Okuda et al., 1976] mentioned 
in Chapter 4. The WLD measures the distance between two words X and Y as 
the number of substitutions (fcj), insertions (to,) and deletions (n») with weights 
p, q and r attached to the respective error categories. Okuda et al. express this 
as 



WLD(X —>Y) = min [pki + qmi 



(5.24) 



i.e. the minimum number of weighted errors it takes to transform the word X 
into Y. To calculate (5.24) Okuda et al. 2 devised a recursive algorithm that 
operates on substrings of increasing length of the the two words, where WLD t (j) 
denotes WLD(xi, ... , Xj — ► j/i, . . . yt). 



WLD t (j) = min 



WLD t (j-l)+q 

WLD t _i(j-l)+ Pj -_ M _i 

WLD t ^{j)+r 



where 



Pj-i,t-i 



if yt-i = Xj-i 
p otherwise 



(5.25) 



To adapt the algorithm to TP it is necessary to take a slightly different view. 
In the Okuda et al. algorithm they were comparing two strings. Here we have a 
network and a string and it is not appropriate to change state without consuming 

2 The notation used here is somewhat different from that used by Okuda et al. 
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anything from the input, as is done in the topmost of the three expressions 
in (5.25). (The expression adds the penalty for insertion to the metric.) The 
expression (5.25) above has to be reformulated to suit the TP algorithm. If the 
word X is represented by a network in which each state j corresponds to Xj , the 
j:th character in X and Y = j/i, . . . , j/t is the transformed word, the algorithm 
presented by Okuda et al. can be reformulated as 



WLD t {j) = min { WLD t ^ (i 



WLD t _i(i) + q (if Xj = x ins ) 
j if y t = Xj 
I p otherwise 

WLD t _i (i) + (j -i-l)r j-i>2 



(5.26) 



The algorithm for computing the WLD presented by Okuda et al. can be 
realized by (5.26) and the network topology of Figure 5.2. The word represented 
by the network in the figure is ' abede ' 3 . The dedicated 'insert' states are there 



q 



o 



o 



<Uy t )=0 ^insert) A nse ^) /T^N 
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otherwise 



Figure 5.2: Computation of weighted Lcvcnshtein distance with Token Passing 



to make sure that inserted characters are penalized equally hard independent of 
which character is inserted and in which position. The insert state is denoted 
Xi ns in (5.26). The relation between equations (5.23) and (5.26) is simply such 
that the transition cost pij implements the insertion weight (g) and the deletion 
weight (r) whereas the local cost function dj(ct) implements the substitution 
weight. 

One of the fundamental ideas of TP is to separate out the low-level pattern- 
matching algorithm(s) of the basic recognition units from the higher level control 
mechanisms. In the case of textual input the low-level task is to recognize words 
in the character input stream. The hypothesizing of word occurrences in the 

3 The model is approximate in the sense that it does not properly cope with insertions and 
deletions at the end-points of the modeled word. This can easily be fixed with entry and exit 
states. The point is just to show that TP can be used for a variety of computational tasks 
and to a large extent it is only a matter of altering the network topology. 
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input is controlled by the higher level, the language model. There are many 
ways in which the pattern matching function can be controlled, or guided. These 
issues will be discussed further in Section 5.4 and Chapter 6, the account below 
focuses on the TP component that facilitates language modeling. 

If the character input stream C — c\ , . . . ,ct contains several words, there 
must be a way that allows for tokens to be passed from one basic recognition 
unit to another. This can be accomplished simply by connecting subnetworks to 
form larger composite networks. In doing so, however, it is necessary to record 
transitions between subnetworks in some way. We are interested in the actual 
word sequence, not just the cost of the best alignment of states to the input. 
To keep track of what word models a token has passed through on its way to 
the end of the input, the token is supplemented with a path identifier. The 
path identifier is a pointer to a Word Link Record (WLR) that contains word 
boundary information. 

As a token is propagated from one subnetwork to another, a new WLR is 
created and the token is set to point to the new WLR, which in turn is set to 
point to the WLR that the token was pointing to prior to the inter-network token 
propagation. Figure 5.3 visualizes the process. The amount of information put 
in the WLR can depend on the language model used (see below), minimally it 
should contain the cost, path identifier and the model identifier. In Figure 5.3 
the time (character stream position) at which the word boundary occurred is 
also appended to the WLR. 



Model id = X 



5 \ / § \ 

_!___! (_. !_ .. . 




Figure 5.3: Inter-network token propagation 



The time index t is not updated after the transition which means that identi- 
fying the word boundary does not imply the consumption of any character from 
the input stream. This could equally well be the other way around. The details 
of exactly how, and under what circumstances, tokens are passed between the 
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basic recognition units are left to Section 5.4. 

5.3 Isolated Word Recognition 

The problem of Isolated Word Recognition (or isolated word error correction) 
is to detect and correct erroneous words without the use of any contextual de- 
pendencies. A word is viewed in isolation. Normally words do not just appear 
individually, they most often appear in a text, a stream of characters. The iden- 
tification of words in such a stream is called tokenization, and Isolated Word 
Recognition (IWR), to be successful, relies on correct tokenization. A conse- 
quence of this implicit assumption, that is generally not spelled out regarding 
IWR, is that an erroneous token is assumed to be the misspelling of exactly 
one word. Under these conditions it is only the nonword misspellings (single or 
multiple) that can be corrected using IWR. 

The classical scheme used to correct misspellings in isolation involves three 
steps: 

Step 1. Detect the erroneous token 

Step 2. Generate alternative correction candidates 

Step 3. Rank the alternative candidates 

The detection step (almost) always means to compare the token with words in 
a dictionary. If the token is not equal to any of the entries in the dictionary, it is 
misspelled. The generation of candidates can be performed in a number of ways 
and it is often interleaved with the ranking process which often involves some 
sort of distance metric. The dictionary word that is closest to the error-token, 
using the distance metric, is the highest ranked candidate and should be chosen 
for correction (cf. Section 4.1). 

The remainder of this section will describe how the Hidden Markov Model 
can be used to perform Isolated Word Recognition within the Token Passing 
framework. In the approach taken here the three steps are melted down into 
one single process, detection, generation and ranking of candidates is simply a 
matter of computing the probability of an error token given the possible words 
in the dictionary. The noisy channel is an illustrative metaphor when using 
probabilistic methods. 



W; 



Channel 



Figure 5.4: The noisy channel 
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A word is inserted at one end of the channel and from the other end comes 
a distorted version of that word. The aim is to restore the original. 

Let W be the M word vocabulary W — {w\, . . . , wm}- Given the character 
sequence C = Ci, . . . ,Ct, which may be erroneous or not, the most likely cor- 
rection is the Wi that maximizes P(iUi\C), 1 < i < M. This number is hard to 
calculate, but fortunately there is Bayes' rule 

P(C\wi)P(wi) 
P{wi\C) = — — (5.27) 

Choosing the word that best matches the character sequence is not dependent on 
the probability of the sequence, which is given, so finding the Wi that maximizes 
the numerator in Bayes rule seems like a good idea. Modeling each word Wi of 
the vocabulary with an HMM M. Wi and making the obviously faulty assumption 
that all words are equiprobable, finding the word w* that best matches the 
character sequence C is simply 

w* = argmax [P(C\M m )} (5.28) 

Wi 

It should be noted that there is no error detection being performed here in the 
normal sense of the word. It might very well be the case that Wi — C for some 
i, i.e. C is not misspelled at all. It would of course suffice to do a simple 
string match between C and the strings of the vocabulary words to find this 
out. The scheme described here presupposes Step 1 above, or, one can think of 
the detection of the error as something that is found out after the ranking of 
the candidates. If the highest scoring candidate w* ^ C, then a spelling error 
has been detected (and corrected at the same time). 

The number P(C\M. Wi ) can be efficiently computed, as shown in Section 5.1, 
but how should the words be modeled using HMMs? 

The type of model used here is the so called left-to-right model, for which 

aij = when j < i 

Often additional constraints are placed on the left-to-right model, such as 

ciij = when j > i + A 

to make sure that large changes in the state indices do not occur. In the left- 
to-right model of Figure 5.5 A is 2. 

The solid arrows represent transitions with non-zero probabilities. The 
dashed arrows indicate what this particular model is biased towards. State 2, 
for example, can have non-zero probabilities for all observables, but is strongly 
biased towards s, i.e. &2(f/c) has the highest probability for Vk = s. 

The general idea is that the states of the model represent character positions 
in the word modeled. Recall the four basic error types: character insertion, 
deletion, substitution and transposition. The model above can deal with all four, 
although perhaps not ideally with transpositions. The transition distribution A 
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Figure 5.5: The structure of M. s hc 



makes it possible to handle deletions and insertions. The character distribution 
B makes it possible to handle substitutions and for transpositions a combination 
of both is needed. The model of Figure 5.5 is not necessarily the best choice, 
some of the restrictions imposed on it can seem a bit strange. There may be 
any number of insertions (the looping arcs), but more than one consecutive 
deletion will not be very well handled since A = 2. If there were a backward 
chaining arc from each state to the preceding state, transpositions would be more 
easily recognized. A character sequence like ' shwo ' could then be generated 
(recognized) with the state sequence 1^2^3^5^4^6 which would 
then score higher than the sequence 1^2^3^5^5^6 which will 
probably be the highest scoring sequence using the model in Figure 5.5. Relaxing 
these constraints however, would lead to a close to ergodic model, and the 
computational advantages of the left-to- right model would be lost. Furthermore, 
restrictions on the number of errors that can occur in a single word are supported 
by findings in samples of spelling errors (cf. Damerau [1964]). 

The quantity P(C\M. Wi ) can be thought of as a measure of similarity be- 
tween the string Wi, modeled by A4 Wi , and the string C. Angell et al. [1983] 
discuss alternative string similarity measures and distinguish three types: ma- 
terial, ordinal and positional similarity. Material similarity measures the extent 
to which a pair of strings contains identical characters, ordinal similarity mea- 
sures the extent to which the characters are in the same order, and positional 
similarity measures the extent to which the characters are in corresponding po- 
sitions in the two strings. Most of the classical techniques use one or sometimes 
two of these similarity measures to perform isolated word error correction. It is 
interesting to see that the approach presented here is actually a mixture of all 
three types. The material similarity is expressed by the probability to observe 
a particular character, and the fact that this probability is conditioned on the 
states also makes it a positional similarity measure. The transition probabilities 
between states capture the ordinal similarity 

The probability that a particular model generated a given character sequence 
is, of course, dependent on how the model has been trained and also what the 
initial (pre-training) model looked like. Ideally one would like to have a large 
set of naturally occurring misspellings for each word in the vocabulary to train 
the respective models with. Such an error corpus is unfortunately not available 
so the corpus has to be artificially generated. These issues will be described in 
some detail in the sections below reporting the experiments. 
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In computing the probability of the character string for all the M. Wi in the 
vocabulary, it is not really the number P(C\A4 Wi ) as such that is interesting, 
it is the probability relative to the other words in the vocabulary that matters. 
To meet this end the Viterbi algorithm computes a good enough approximation 
of the probability of a string 4 . When the length of the character sequence 
grows, an arithmetic underflow condition can occur in the forward-backward 
algorithm, i.e. as t — > oo, ctt(i) — > 0. In the Viterbi algorithm, since it maximizes 
instead of sums (see equations (5.2) and (5.9)), logarithms of the probabilities 
can be used and underflow conditions do not arise. The logarithms of the 
probabilities can, of course, be computed beforehand and does not constitute 
an increased computational load in the actual recognition task. As well as 
maximizing the logarithm of the probability of a path (state sequence) , one can 
of course minimize the negative logarithm of the probability of the path, the cost 
of the state sequence. Let S t (j) be the counterpart of (/>t(j) in equation (5.6) 
and denote the cheapest state sequence that ends in qj and accounts for the first 
t characters. The reformulation of the Viterbi algorithm is straightforward: 



Initialization 



1< i < N 



,)„(/. = <; " ** \ (5.29) 

otherwise 



Induction 



1 < t < T , 2<j<N-l 



Termination 



StU) = ,™ , [St-i(i) + (-log ay)] + (-logb 3 (c t ) (5.30) 

Ki<N— 1 



\ogP(C,Q*\M) = min [S T (i) + (-log a lN )} (5.31) 

Ki<iV-l 



The adaption of the Viterbi algorithm to Token Passing is quite trivial. 
Equations (5.30) and (5.23) are virtually identical. The transition cost is the 
negative logarithm of the transition distribution of the HMM and the local cost 
function is the negative logarithm of the observation symbol distribution. 

Algorithm 1 below describes how the Viterbi algorithm is computed with 
TP in a single HMM network. The algorithm works for any network topology, 
but it might be useful to keep the HMM of Figure 5.5 in mind. 

The HMM network has N states numbered 1 to N, where state 1 is the 
entry state and state N is the exit state. Each token r contains only the cost, 
as computed by (5.30). Let Ti(5t) denote the cost of the token in state i at time 
t. The start token has cost — logl = 0. The null token has cost — logO = oo. 
The input character sequence is C = Ci, . . . ,Ct- 

4 Thus P(C, Q* \M.w i ) is used as an approximation for P(C\Mw { )■ The Viterbi algorithm is 
the algorithm generally used in applications performing some sort of recognition with HMMs. 
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Algorithm 1 Viterbi decoding using Token Passing with a single HMM net- 
work. The boxed portion is the Stepjmodel{c t ) procedure that is reused in 
Algorithms 2 and 4 below. 



At time t = 
Put start token in the entry state 
Put null tokens in all other states 

for t = 1 to T do 



9 
10 
11 
12 
13 

14: 

15: 



for all states i < N do 

Pass a copy of the token n to all connecting states j: 

Tj(5 t ) = Ti(6 t -i) + (- log a 4J ) + (- logbj(c t )) 
end for 

Discard all original tokens 
for all states i < N do 

Find the minimum cost token and discard the rest 
end for 
for all states i connected to state N do 

Pass a copy of the token n to state N: 

TN(St) = n(6t) + (-logOjjv) 
end for 

In state N: Find the minimum cost token and discard the rest 



16: end for 



In Isolated Word Recognition it is assumed that the input C\ , . . . ,Ct is ex- 
actly one distorted word. This assumption implies that there is no point in 
making the exit state transition before t — T, i.e. lines 12 and 13 can be put 
outside of the main loop in Algorithm 1. The step-model procedure as displayed 
above will, however, be reused in a situation in which this assumption does not 
hold, thus the exit state transition has to be hypothesized for each value of t, 
i.e. any character ct may be the last character in the word modeled by the basic 
recognition unit. 

When the entire input has been consumed, the cost of the token in the exit 
state of the model represents the best alignment of states to the input, i.e. 

M w ^t n (5 t ) = -log(P{C,Q*\M Wi )) 

where Q* is the optimal path 5 through the model and Wi is the word (the 
model identifier) that is modeled by the network. To perform Isolated Word 
Recognition over an M word vocabulary W = {wi, . . . ,%} it is, of course, 
necessary to execute Algorithm 1 for all the M basic recognition units and then 
choose Wi as the correction hypothesis if M Wi ^tn{5t) has the lowest cost. 

During the computation of the step_model procedure, many models will dis- 
play significantly different costs. This is, of course, the whole idea. The point 
however, is that these differences will start to show up when only a relatively 



5 The actual state sequence through the basic recognition unit is not recorded in any way 
in Algorithm 1 and it can not be restored. 
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small number of input characters have been processed. Consider for example the 
input C = 'heuristics', and assume that both Mheuristics and M-exhaustive 
are in the vocabulary. Then 

■Mheuristics^T (O4) <C M-exhaustive^T \0t) 

after just a few characters. (r*(5t) denotes the best token in any of the states 
at time t.) In a situation like this the Beam Search heuristic can be useful. 
All models that are outside the beam are deactivated (pruned). The beam is 
defined as the difference between the globally optimal model A4*^t* (5t), where 

M*^t* (S t ) = min [M Wi ^r* (S t )} (5.32) 

(the cost of the minimum cost token of all states of all models), and a preset 
threshold B. The overhead of the beam search heuristic is the computation 
of (5.32). If at any time t it is the case that the model M. Wi is outside the 
beam, i.e. 

M Wi ^r* {S t ) > M*^t* (S t ) + B (5.33) 

M. Wi can be deactivated. The algorithm for isolated word recognition using 
HMMs within the TP framework is given in Algorithm 2. 

Algorithm 2 Isolated Word Recognition with Beam Search 
1: At time t = 
2: All models are activated 

3: Put start token in the entry state of all models 
4: Put null tokens in all other states 

5: M*^T*(S t ) =00 

6: for t = 1 to T do 

7: for all models M Wi l < i < M do 

8: if A4 Wi is active then 

9: if (5.33) then 

10: Deactivate M. Wi 

11: else 

12: step_model(c t ) with A4 Wi 

13: end if 

14: end if 

15: end for 

16: compute M*^T*(S t ) according to (5.32) 
17: end for 
18: Return word as the reading of C where 

word = argmin [M Wi ^T N (8T)\ {Only the active M Wi } 

At time T the exit state of all active models is inspected, the model with 
the lowest cost in its exit state is the one that best matches the character se- 
quence, and the model's identifier denotes the preferred reading of the character 
sequence. 
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The technique described above, although quite successful at correcting mis- 
spellings, has sonic obvious shortcomings in most practical applications. In 
running text the word boundaries are uncertain and erroneous assumptions in 
this respect will result in segmentation errors being treated as spelling errors. 
Another problem is that real-word errors will go undetected. Further, the fact 
that words are not equally likely in a given context can not be overlooked. The 
way to address this problem is to look at the words in the context in which they 
appear. Decisions regarding word boundaries must take into account the fact 
that segmentation errors may be present in the input. 

5.4 Connected Text Recognition 

In Connected Text Recognition (CTR) the character sequence can contain any 
number of words. The task is to find the most likely word sequence even though 
the word boundaries may be obscured (segmentation errors) and the words 
themselves are distorted (misspellings). Thus we want to find the word se- 



W = w i' w »'-' w n, Channel 



Figure 5.6: The word sequence noisy channel 

quence W that maximizes the quantity P(W\C). This number is impractical to 
compute, but again 

Pmc) - nempm (5 . 34) 

according to Bayes' rule. The denominator is given, so to find W* , the most 
likely word sequence, it suffices to maximize the numerator of (5.34) over the 
alternative word sequences. 

P(W* \C) = max [P{C\W)P{W)\ (5.35) 

w 

The first factor of the right hand side of (5.35) is the channel characteristics (see 
Figure 5.6) that models how word sequences are distorted. The second factor 
is the prior probability of a word sequence, the language model. Probabilistic 
language models usually exploit the local context to predict the occurrence of a 
word. Assume, for the sake of this formal account, that the unspecified language 
model Q can be used to predict the probability of a word, P(wi\Q). In this way 
the prior probability of a word sequence W = w\ , . . . , w n can be computed as 

n 

P(W) = Y[P( Wi \g) (5.36) 
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Irrespective of the language model used, (and even without a language 
model), the problem remains to find the most likely segmentation of the in- 
put character stream C = c\, . . . , Ct, i.e. each word in the word sequence has 
to have a portion of the input characters assigned to it. The reason is, of course, 
that the channel characteristics must come to bear on the overall likelihood of 
the word sequence. For a given word sequence W = W\, . . . ,w n the most likely 
segmentation can be found by maximizing over all possible word boundaries, 
i.e. 



P{C\W) = max 

\<ti.t\<T 



ti<t[ 



II p ( c *:i^) 



*' I 
r[c 



(5.37) 



where c t * denotes c ti , c ti +i, ■ ■ ■ ,Ct>.-i,c t '., the character sequence 'assigned' to 
word Wi. Note that t' i _ 1 + l = U. Equations (5.37), (5.36) and (5.35) can be used 
to formally define the Connected Text Recognition problem in equation (5.38). 



P(W*\C) = max 
w 

l<t;,t-<T 

U<t'i 



Hp(4\ Wi )P( Wi \g) 



t'. 



(5.38) 



Finding words in distorted text is quite similar to the speech recognition 
problem. The problem of Connected Speech Recognition (CSR) is to recognize 
and segment out the elements of the continuous speech signal without knowing 
the starting point or the end-point of any of the elements. The elements modeled 
can be phones, subphones or sometimes with a small vocabulary they are words. 

It is interesting to note the differences and similarities between text process- 
ing and speech processing. The primitive input symbol in the text case is, of 
course, the character or the keystroke. In the speech case, without going into 
the hardships of signal processing, the primitive input symbol is the feature 
vector. In most speech recognition systems the feature vector is continuous and 
has around 20 - 40 coefficients and the speech signal is sampled at about 100 
Hz, see for example [Deller et al., 1993]. Whatever statistical model is used to 
model, say, a spoken word as a sequence of, say 50, feature vectors, it is obvious 
that there can never be a perfect match. The word model that is a closest match 
is chosen. Looking at a spoken word as a sequence of feature vectors one can 
thus say that the word is always 'misspelled', where the norm is an imagined 
sequence of feature vectors. 

Looking at speech as a sequence of feature vectors, and text as a sequence 
of characters, a crucial difference is that there is no counterpart to the space 
character in speech. This makes segmentation a primary concern in CSR. In 
CTR the segmentation task is quite easy as long as the space character is prop- 
erly placed but can get quite hard when it is not. The point here is that if we 
agree to carry the error types of text processing over to speech processing, we 
see that speech is virtually littered with misspellings and segmentation errors. 
It is therefore close at hand to see what the methods used in the difficult speech 
recognition task can do in the relatively simple text recognition problem. 
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The present computational model can straightforwardly be put to use on the 
Connected Text Recognition problem using a layering scheme of the different 
knowledge sources. The bottom layer, the string pattern matcher, is called the 
Orthographic Decoder (OD). The OD consists of a set of word modeling HMMs, 
one for each word in the vocabulary, very much like in the previous section. Each 
word modeling HMM can assign a probability to the hypothesis that a certain 
substring is the word modeled by the network, i.e. the first factor in (5.38). 
Subsequent layers, on top of the Orthographic Decoder, is called the Linguistic 
Decoder (LD). The LD is the component that hypothesizes word occurrences 
in the input using some language model, i.e. the second factor in (5.38). The 
information flow between the LD and the OD is such that the LD predicts that 
a certain word is present at a particular point in the character input stream, 
and the OD reports back the confidence of the match. It is thus a top-down 
process. 

The Orthographic Decoder and Linguistic Decoder to some extent have com- 
plementary responsibilities in the recognition and error correction process. The 
string matching OD is primarily responsible for nonword errors, both regular 
spelling errors and segmentation errors. The LD reduces the search space and 
decides on 'close calls'. In an utterance like: ' ... in the aboue table', 
the OD would probably rule 'above' and 'about' just about equally likely, 
but since 'above' is more linguistically plausible in this context, the LD will 
(hopefully) rule out ' about ' . The predictive power of the LD is of course even 
more crucial when dealing with real-word errors since the OD will assign a good 
match to the 'wrong' word. There is always a trade-off situation going on be- 
tween the LD and the OD in the recognition process. If both favor the same 
hypothesis, it will be a clear winner. If not, as in the case with the real-word 
errors, there will be several viable hypotheses. 

It was mentioned above that the Linguistic Decoder may consist of several 
layers. In the dialogue application described above, for example, it could be 
useful to model a dialogue in terms of user utterances and system responses. 
An utterance can be given a phrase structure or a bracketing in terms of phrases, 
and phrases can be modeled as word sequences. In this hypothetical scenario 
the LD would consist of three layers, dialogue, utterance and phrase. The 
generalization from one to several LD layers is trivial, so the Token Passing 
account of Connected Text Recognition below will be restricted to a single 
layer Linguistic Decoder that models utterances or sentences in terms of word 
sequences. 

In Connected Text Recognition, each Orthographic Decoder HMM network 
models one word of the vocabulary. This is very much like the idea behind 
Isolated Word Recognition in the previous section. There is, however, one big 
difference. In CTR the input character stream contains word boundaries, usu- 
ally realized by the space character. In finding the best segmentation of the 
input stream the space character is of course crucial, but we do not want to 
make a hard decision regarding the location of the word boundary based solely 
on the fact that there is a space character in a certain position in the input. The 
OD must model word boundary locations, treating the space character like any 
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other character. There are different ways of doing this. One way would be to 
have an HMM network to recognize word boundaries. A single space character, 
for example, would then make up the word ' u ', modeled by M. <space> 6 . An 
utterance like: ' show me all cars ' would then be segmented as 

==> show me all cars (22) 

This is not such a good idea, however, since language modeling would be- 
come unjustifiably expensive. For example, to model utterances with bigrams 
would in fact require trigrams since each pair of proper words is divided by the 
information-poor space-word. A different solution is needed. 

The space characters present in the utterances are treated as parts of the 
word models. The solution is quite simple. The OD HMMs that are used in 
CTR have an extra initial state added to them that is biased towards recognizing 
the space character. The HMMs of the OD are then simply trained accordingly 




Figure 5.7: The structure of M. show with initial space-state 



(see Chapter 6 below). The HMM A4_ s how, for example, will score maximum 
probability for the character sequence ' u show'. Using the approach depicted 
in Figure 5.7, the utterance (22) would be segmented as 

==> show me all cars (23) 

The fact that the first word's initial space character is not in the input does not 
present a problem. The two types of segmentation errors are dealt with in the 
following way. By taking the skip transition past the space-state, run-ons can 
be handled. Space characters can (if the network is so trained) be emitted at 
any state, thus splits can be modeled as well. 

The Linguistic Decoder provides the pattern matching Orthographic De- 
coder with context and supervises the passing of tokens between word models 
of the OD. From a computational view this can be accomplished by connecting 
the set of word model HMMs together in a large super-HMM where the connec- 
tions between subnetworks determine what words can follow others and with 
what probability. This solution can actually be hard-wired into a system, but 
the system will be awfully rigid. The Token Passing framework provides a much 
more flexible approach. The forwarding of tokens from the exit state of one OD 



6 Note that what constitutes a word is quite arbitrary. Since the space character is just 
another character, one might have 'as soon as possible' e.g. as a 'word'. In the computa- 
tional scheme described here, a word is merely a sequence of characters that has a network in 
the OD modeling it. 
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HMM to the entry state of another is elevated to a higher level of control, the 
Linguistic Decoder. The language model encoded by the LD may of course vary, 
but it is reasonable to use a probabilistic language model since the OD operates 
on probabilities. This facilitates for a clean LD-OD interface, the LD and the 
OD use a common communication protocol. The Connected Text Recognition 
experiments reported here employ LDs that encode language models that can 
be expressed with HMM networks. This means that the Viterbi algorithm can 
also be used in the LD. In the following we will thus assume a single HMM 
network in the Linguistic Decoder. 

The Linguistic Decoder HMM network assigns probabilities to word se- 
quences. The states of the network represent different contexts. (What is meant 
by a context of course varies depending on the language model.) Specific word 
occurrences are more or less likely to occur in a given context. The transition 
distribution of the HMM is thus the probability of going from one context to 
another, and the observation symbol distribution is the probability of a word 
given the context. The Viterbi algorithm can be used to compute the proba- 
bility of the word sequence W and the optimal context sequence CON* given 
the LD HMM Mld, i.e. P(W, CON*\M L d)- Note that Mld is the language 
model Q introduced in equation (5.36), and that P(W, CON*\Mld) is used as 
an approximation («) for P(W\Mld) in equation (5.40) 7 . 



P(W, CON*\Mld) = max 

CON 



Y\_P{corii 



v)P{wi 



(5.39) 



where cono is a 'dummy-context', or, in the present computational model, the 
entry state of Mld- 

The quantity P(wi\corii) in equation (5.39) can be thought of as the interface 
between the two knowledge sources, the coupling between the layers that enables 
the recognizer to produce the word sequence that is overall most likely, with 
respect to both orthographic evidence and linguistic expectation. When the LD 
predicts that the word Wi is present in the input, the OD HMM M Wi begins to 
evaluate that hypothesis. Equation (5.38) can be made more specific: 



P{W*\C) 



max 

W,CON 
l<U,t' t <T 

U<t'< 



lb 

TT P(ct.\wi) P{coni\coni-i)P(wi\coni) 



OD 



LD 



(5.40) 



The equation above can be visualized in the Token Passing framework (see 
Figure 5.1 in Section 5.2). Tokens are passed within the LD HMM according 
to the transition distribution of the network and word models of the OD arc 
hypothesized according to the observation distribution. As tokens reach the 
exit state of the word model, they are propagated back to the context/state 
in which they were originally proposed. This upward propagation does not 



7 Remember from Section 5.3 that P(c t \, Q*\wi) is already used to approximate P(c t z \wi). 
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Figure 5.8: The LD-OD interface 



constitute an extra cost. Note that whereas the workings of the OD HMMs 
is time-synchronized the LD is not. One step of the step_model procedure is 
executed in the OD HMMs for each character that is read from the input. 
The LD simply reacts to tokens that are passed up from beneath, and this has 
nothing to do with the the time index t. This means for example (see Figure 5.8) 
that a token in the exit state of some word model at time t can get passed up to 
corii-i, then passed from there to corii, from corii to the entry state of M. Wi and 
time is still t. When the OD starts processing the hypothesis Wi, and characters 
are read from the input, the time index is, of course, incremented. 

The Viterbi algorithm is used to compute the probability of an observation 
sequence and the optimal state sequence. The state sequence is not of great 
interest when computing the probability of a character sequence given a word 
model P(c t t i i ,Q*\M Wi ), the quantity computed by the OD HMM networks. The 
states of an OD network represent character positions and this information is not 
really useful, the quantity is used more like an approximation for P(c t 4 \M. W ^). 
The same reasoning can be applied to the LD (see Section 5.2), but whether the 
context sequence is important or not depends on what a context represents in the 
language model encoded by the LD network. Irrespective of whether the context 
sequence is informative or not, the word sequence certainly is. Retrieving the 
word sequence is our main objective. Since each token arriving in an LD network 
state from below represents a word matched with the input, this event has to 
be recorded so that backtracking is enabled. The data structure used to record 
the word hypothesis is the Word Link Record (see Section 5.2). 

The Word Link Records (WLR) are stored in a linked list structure where 
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each path through the structure represents a word sequence hypothesis 8 . Each 
token, besides the cost, keeps a path identifier (pointer) to the last WLR in 
the word sequence hypothesis. That WLR has a pointer to its predecessor and 
so on. The token itself represents the head of the path. The WLR is created 
as a token in the exit state of an OD network is propagated back to the LD 
state/context that hypothesized the word, and the new WLR is incorporated 
into the list structure. The WLR contains the cost of the path (up to the point 
where it was created), the path identifier (inherited from the token), the time 
index (character position) at which it was created and the model identifier of the 
OD HMM that the token exited from 9 . The process is visualized in Figure 5.9. 
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Figure 5.9: A Word Link Record is created as a token is passed upwards 



The 'start- WLR' with * as model identifier is the root of the linked list 
structure. At the start of the recognition process the start token in the LD 
points to this WLR. Note that there might be several tokens pointing to the 
same WLR, along with other WLRs. The creation of new WLRs is called 
record-decisions, Algorithm 3, and it is of course crucial in Connected Text 
Recognition. The four fields of the WLR in Algorithm 3 are denoted: S, ] w ln 
time and word. The cost and path identifier fields of the token arc denoted St 
(as usual) and ] w i r - 

The reader should not be bothered by the fact that the algorithms here and 
elsewhere in this thesis do not make sense down to the last detail. For example, 
how can a token know which state of the LD that hypothesized it so that it can 
get propagated back up to that state once it reaches the exit state of the OD 



8 The LD layer uses the linked list structure to keep track of what is going on in the layer 
beneath. In the general case, where there might be several LD layers, each layer would need 
its own WLR list structure 

9 It is possible to store additional information in the WLRs, the context in which the word 
occurrence was hypothesized, for example, might be useful in subsequent processing of the 
text. 
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for all states i < N {of the Linguistic Decoder} do 
if % holds a token t% then 
create a new WLR wlr 
with wlr do 
wlr (6) = Ti(5 t ) 

Wlr(\wlr) = T~i(\wlr) 

wlr {time) = t 

wlr{word) = Wk {-M Wk is the OD HMM that propagated n} 
end with 
n^wir) = wlr 
end if 
end for 



network? There is of course a simple technical solution to this sort of problem 
and issues of this type are generally suppressed in the algorithmic outlines. The 
purpose of the algorithms is to convey the basic idea. 

The Token Passing algorithm is now set to recognize word sequences instead 
of isolated words. The LD HMM used in the superficial algorithmic presen- 
tation below is similar to the OD HMM in Figure 5.7 except that it is not 
limited to left-to-right transitions (and of course, the observation symbols are 
not characters but words). The Viterbi algorithm is used both in the LD and 
the OD. The Beam Search heuristic has a slightly different effect in Connected 
Text Recognition compared to Isolated Word Recognition. If a word model 
gets deactivated in IWR it stays deactivated for the duration of the recognition 
process, in CTR a word model may be reactivated at any time (cf. line 10 in 
Algorithm 4). The stcpjmodel procedure, Algorithm 1, is reused here with only 
minor changes. The token put in the entry state of the OD HMM does not have 
zero cost since it has been subject to prior cost accumulation. Recall: the start 
token has cost — log 1 = and the null token has cost — log = oo 

At time T the most likely word sequence can be established by following 
the path identifier chain of the token in the exit state {Mld^'Tn) back to the 
start- WLR, which indicates the start of the sequence. 

The algorithm is quite easily generalized to perform A-best search. It suffices 
to let each state hold N tokens instead of just one. Lines 27 and 33 of Algo- 
rithm 4 should be changed to: 'Find the N tokens with min cost and discard 
the rest'. 

Algorithm 4 returns the best word sequence, i.e. it performs a 1-best search. 
Under certain circumstances, however, Algorithm 4 is suboptimal in the sense 
that it can not guarantee that the overall best word sequence is returned. If the 
algorithm is run in 1-best mode, i.e. each state of the OD HMMs can hold only 
one token, and the same word model can be hypothesized from more than one 
state, it can happen that the token that would score the overall lowest cost if it 
was allowed to proceed, gets pruned because there is another token with a lower 
cost for the left context that is preferred in the entry state of the OD HMM. The 
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11 

12 
13 
14 

15 
16 
17 

18 
19 
20 
21 
22 
2.3 
2d 
25 
26 
27 
28 
29 
30 
31 

32 
33 
34 

35 



for t = do 

Create the start- WLR 

Put start token in the entry state of the LD 
and let it point to the start- WLR 
Put null tokens in all other states of the LD 
Deactivate all models of the OD 
M*^T*(8 t ) = oo 
end for 
for t = 1 to T do 

for all states i < N in A4ld with a non-null token do 

Pass a copy of the token Tj to the entry state of all M. Wk observable in 
state j: 

M Wk ^T 1 (S t ) = Ti(S t ) + (-log ay) + (-\ogbj(w k )) {Reactivation} 
end for 

Put null tokens in all states in Mld 
for all models M Wk l < k < M do 
if A4 Wk is active then 
if (5.33) then 

Deactivate Ai Wk 
else 

step_model(ct) with M. Wk 
end if 
end if 
end for 

compute Ad*—>T*(5t) according to (5.32) 
if the token in the exit state of M. Wk is non-null then 

Propagate the token up to the LD state that hypothesized it 
end if 
for all states i < N in A4ld do 

Find the token with min cost and discard the rest 
end for 
record-decisions 

for all states i < N in A4ld connected to state N do 
Pass a copy of the token t^ to state N: 

TN(St) = Ti(S t ) + (-logflijv) 

end for 

In state N of A4ld' Find the token with min cost and discard the rest 
end for 

Backtrack AiLD^Tpf(1 w i r ) 
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part-of-speech bigram language model is an example of a language model with 
this property (the words are ambiguous with respect to their part-of-speech). 
Young et al. [1989] get around this problem by having multiple instances of the 
same word model, one instance for each context in which it is observable. A 
variation of the same scheme is used here. Instead of having multiple network 
instances, each state can hold more than one token, i.e. iV-best search is used to 
guarantee that the overall best sequence is obtained. (See the following chapter 
for further comments on this subject.) 

Note that the word sequence returned is really a sequence of model identi- 
fiers. Supposedly the model identifier of a model is exactly the character se- 
quence that the model models, e.g. the model identifier of M._ s how is ' u show', 
but that need not necessarily be the case. For some word-forms it is unrealistic 
to have one HMM network for each word- form (character sequence). The nat- 
ural numbers is one such 'word group'. A solution to this problem would be to 
have a single network that recognizes a number of word-forms, e.g. dates, social 
security numbers and possibly even proper nouns. The network A4 <c iate> with 
model identifier '<date>' would then recognize character sequences like '4 of 
July ', and the input 'On the 4 July we went to New York' could come out 
as 'On the <date> we went to <city>'. Since the WLRs contain the word 
boundary positions, it is possible to extract the character sequence recognized 
as '<date>' from the character input stream. 

The approach to Connected Text Recognition presented here has the po- 
tential to deal with all the lexical errors discussed in this thesis. It has the 
potential to handle misspellings, run-ons and splits, single and multiple char- 
acter errors and nonword and real-word errors. Having the potential to solve a 
problem is, however, not the same thing as actually solving it. This of course ul- 
timately depends on the accuracy with which the Orthographic Decoder models 
the noisy channel and the Linguistic Decoder the language. The performance 
of this approach has to be experimentally evaluated. 



Chapter 6 

Experimental Evaluation 



To test our ideas of the layered HMM approach in the Token Passing framework 
we have developed a system, CTR, to perform Connected Text Recognition. The 
system is based on the algorithms presented in the previous chapter and is thus 
also restricted to two layers. The Linguistic Decoder is represented in one 
(the topmost) layer by a single HMM network and the Orthographic Decoder 
is realized by the set of word modeling HMM networks in the bottom layer. 
The Beam Search threshold defines the level of trade-off between accuracy and 
computational efficiency (speed). In the experiments reported below we have 
opted for accuracy at the expense of speed. We have used an infinitely wide 
Beam so that no hypotheses are ever pruned. 

Much of the work presented in this thesis is based on the findings in the 
dialogue corpus profiled in Chapter 3. The dialogue corpus application was 
also the focal point during the development of the techniques. The CARS part 
of the dialogue corpus was the first error corpus that was given to CTR for 
testing (Section 6.1). Later we came to the conclusion that we needed to test 
on a larger corpus as well, and this materialized in the SECRETARY experiment 
(Section 6.2). The secretary experiment is a completely different application 
compared to the dialogue scenario, it concerns transcription typing of a software 
manual. 



6.1 CARS 

The CTR experiments reported here concern the CARS corpus. The corpus in- 
cludes 20 dialogues. It contains in all 369 utterances, 3,139 word tokens and 584 
word types. There are 92 lexical errors distributed over 71 utterances. There 
are 62 misspellings, 17 run-ons and 13 splits 1 . 



J The figures on error frequencies and results presented in this thesis are not in complete 
agreement with figures presented in previous publications [Ingcls, 1996b, Ingcls, 1996a], The 
reason is that different definitions have been used for multiple segmentation errors. A string 
like 'toyotapeugeotvolkswagen' is treated as one multiple run-on here, whereas in the other 
publications this string was regarded as two run-ons. 

65 
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The intention with the CARS experiment is first and foremost to get a handle 
on the overall error correcting performance of the CTR system. We are also 
interested in seeing what impact different language models will have on the 
system. With sparse data, as in the present case, it is not absolutely certain 
that a Linguistic Decoder implementing a language model will have a strong 
positive effect. Experiments have been conducted on three different, rather 
weak, language models, a Unigram language model and two tag Bigram language 
models. The difference between the two tag Bigram language models is that 
they use different tag-sets. One uses a small domain-oriented tag-set while the 
other employs a Part-Of-Speech (POS) tag-set. From the point of view of the 
application it is of interest to note any differences between the application- 
close domain oriented tag-set and the linguistically oriented tag-set. We also 
have a Baseline to which the results of these experiments can be compared. The 
Baseline experiment involves no linguistic constraints so the correction of lexical 
errors is performed by the Orthographic Decoder alone. 

The 20 dialogues were randomly divided into five parts of four dialogues 
each. In the experiments, 16 dialogues (four parts) were used to obtain the 
language model and then the model was tested on the remaining four dialogues 
(one part). The partitionings were rotated so that each language model was 
tested on all of the five parts. The same Orthographic Decoder was used in all 
the experiments. 

6.1.1 The Orthographic Decoder 

The Orthographic Decoder contains 584 word modeling HMMs, one for each 
word type in the corpus. The structure of the OD HMMs can be seen in fig- 
ure 5.7. Ideally each HHM should be trained on typical errors occurring in 
Swedish text. Unfortunately there is no such error corpus available and we can 
certainly not train the HMMs on the errors occurring in the corpus. We must 
find a way to generate an error corpus so that the OD HMMs can be trained 
and used for other purposes as well, not just to identify the particular errors in 
this corpus. 

Inspired by the basic error types one can construct four error generating 
functions. Given a word, these functions will produce a set of corrupted forms of 
the input word. Given the word ' u show' for example, the deletion function will 
produce: {'show', ' u how', ' u sow', ' u shw', ' u sho'}. The deletion operator 
is applied to each character position in the word. Although these error types 
apply to the space character as well as to any other character, we have an extra 
error type dealing only with the space character, which is called white-space 
insertion. There is also an error type called double stroke. The error types 
insertion and substitution raise the question as to what to insert and what to 
substitute for, respectively. One hypothesis is that keyboard neighbors are likely 
to take part in, for example, substitutions. The neighbors 2 of ' o ' are ' i ' and 
'p', so if substitutions are applied, the error corpus for M s how will contain 



2 The neighbor relation is limited to immediate left and right neighbors. 
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amongst others ' u shiw' and ' u shpw'. The list of error-generating operators 
that have been considered for training of the OD HMMs is thus: 

• deletion (e.g. ' u shw') 

• insertion (e.g. ' u shpow') 

• substitution (e.g. ' u shpw') 

• transposition (e.g. ' u sohw') 

• white-space insertion (e.g. ' u sh u ow') 

• double stroke (e.g. ' u shoow') 

Note that the basic error-generating functions above will produce mostly per- 
formance related errors, i.e. they are likely to be generated by a human only 
by mistake. The error corpora generated for the various OD HMMs are conse- 
quently quite poor with respect to knowledge of spelling errors people produce 
for other reasons. Phonetic resemblance and other cognitively related difficulties 
are not included in the corpora. The only 'external' knowledge included in the 
training corpora is the layout of the keyboard. 

It is, of course, often the case that a corruption generated by one of the 
above error functions turns out as another legal word in the vocabulary. If 
deletion is applied to 'them' for example, 'the' will be part of 'them''s error 
corpus. A simple little program filter was devised to remove all such 'real- 
word' corruptions. If the corpora are filtered, the effect will be that real-word 
errors are harder to recognize, but on the other hand CTR will be less likely to 
change words that are properly spelled. 

The word models of the OD have been trained on their respective error 
corpora with the Baum- Welch reestimation algorithm. A maximum likelihood 
estimation procedure like the Baum- Welch algorithm will assign zero proba- 
bilities to all unseen events. Since unexpected events such as misspellings not 
included in the corpus will most likely appear, the parameters of the model must 
be smoothed. Smoothing is the term generally used for making the distributions 
more uniform. Very low probabilities are adjusted upwards and high probabili- 
ties are adjusted downwards. The models used in the experiments reported here 
have all been smoothed with one of the simplest smoothing schemes, the addi- 
tive smoothing scheme [Levinson et al., 1983] (p. 1053). After the model has 
been trained, a small number e > is assigned to all parameters corresponding 
to unseen events and the other parameters are adjusted downwards accordingly. 

Because of the fact that no Swedish corpus of actually occurring spelling 
and segmentation errors is available, the process of finding a reasonable training 
setup for the OD HMMs is very much a matter of trial and error. Prior to the 
first experiments on CARS the Baseline system configuration (see Figure 6.1) 
was used to try out different training corpora and smoothing parameters for 
the OD HMMs. The Baseline system configuration runs without a Linguistic 
Decoder. At each time t the best token, out of the M possible tokens that can 
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be propagated out of the exit states of the M OD HMMs, corresponds to the 
best word hypothesis at time t. This token generates a WLR corresponding to 
the word hypothesis and copies of the token are inserted into the entry state of 
all M word models, i.e. the tokens just circulate through the LD without any 
cost being added. 



LD 



OD 



Figure 6.1: The Baseline experiment - CTR setup 



In the experiments reported in Section 6.1.5, the error corpora were gener- 
ated with the error functions deletion, substitution, and white-space insertion. 
Apart from this general strategy, some special words need specialized corpora. 
These words include single character 'words' such as ' , ' , ' ? ' , ' . ' and so on. 
There are seven such words in the vocabulary and the training material is a small 
set of hand-made corruptions that only involve the space character. There are 
51 numbers in the vocabulary. There are car prices, figures for fuel consump- 
tion, grades and so on. Although there are more clever ways to handle things 
like this (cf. Section 5.4 page 64), the numbers all have their own individual 
HMM in the OD modeling it. These special words have corpora generated with 
only the white-space insertion error function. The error corpora thus generated 
were then filtered for real-word errors. The OD HMMs were trained with the 
Baum- Welch reestimation algorithm. After training, each HMM had its obser- 
vation symbol distribution smoothed with the additive smoothing scheme with 
e obs = 10- 4 . 

Note that we are evading the unknown word problem. Even if a word type 
is unseen in the training corpus of an experiment, the OD will still contain the 
model corresponding to the unseen word. 
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6.1.2 The Unigram Language Model 

The Unigram language model: 

T 

P{w u ... ,w T ) = Y[P(wi) (6.1) 

i=i 

The language model's parameters are extracted from the training corpus of the 
five partitionings. 

Count(wi) 

V ' N 

where N is the number of word tokens in the training corpus. For each of 
the five partitionings there will be a fair amount of unknown words in the test 
corpus that have to be smoothed. Recall that one fifth of the entire corpus is 
held out for testing in each partition. It should be noted that the observables of 
the Linguistic Decoder that are smoothed are exactly the words that are unseen 
in the training material but are members of the vocabulary, i.e. word models 
with non-zero probability after smoothing are the words of the vocabulary and 
no other. 

The Linguistic Decoder realizing the Unigram model is a single state HMM 
(three states including the entry and exit states). The parameters of the Uni- 
gram make up the observation symbol distribution of the LD HMM. 

The results on the Unigram Linguistic Decoder reported in Section 6.1.5 refer 
to the combined results from the five experiments with the five partitionings 
where the LD have been smoothed with the additive smoothing scheme with 

tabs = 10- 4 . 

6.1.3 The Domain- Tag Bigram Language Model 

In the domain-tag Bigram language model there are 19 tags. The words of 
the corpus are grouped into classes that are semantically- or domain-oriented. 
Examples of classes and class members are: 

• Object Head (OH), e.g. 'usaab u 900', ' u all' 

• Aspect Head (AH), e.g. ' u costs', ' u acceleration' 

• Communicative Head (CH), e.g. ' u show', ' u example' 

The complete tag-set is listed in Appendix A.l. 

If tag^ +1 denotes a sequence of T tags assigned to a sequence of T words, 
(plus the dummy tag tagT+i corresponding to the nonexistent word Wt+i), the 
tag Bigram language model looks like: 

T 

P(wi,...,wt)= J2 Y[P(wi\tagi)P(tag i+1 \tagi) (6.2) 

alltag 1 % ~ ^ 



70 CHAPTER 6. EXPERIMENTAL EVALUATION 

The language model's parameters are extracted from the tagged training corpus 
of the five partitionings: 



P(tag i+ i\tagi) = 



P(wi\tagi) = 



Count(tagi, tag^i) 
Count(tagi) 

Count{tagi, tOj) 
Count(tagi) 



The tag Bigram can be straightforwardly implemented as our Linguistic 
Decoder HMM, see for example Cutting et al. [1992]. The second factor on the 
right-hand side of equation (6.2) is the transition distribution of the LD and the 
first factor is the observation distribution. The observables of the LD HMM arc 
the words of the vocabulary, or in other words, the word modeling HMMs of 
the OD. The CTR setup is shown in Figure 5.8 where the contexts correspond 
to the tags of the language model. 

The tag Bigram language model models local contextual dependencies. These 
dependencies are weak. There is a great deal of uncertainty as to what the next 
word might be, judging from the tag assigned to the present word. The uncer- 
tainty is emphasized by the relatively small tag-set, each class has a relatively 
large amount of potential realizations. Still, the test corpus will contain both 
unseen tag-to-tag transitions and, of course, previously unseen words. This 
means that both the transition distribution and the observation distribution of 
the LD have to be smoothed. In the case of the Unigram, the single state of the 
LD has the entire vocabulary as observables. The states of the tag Bigram LD 
each have a subset of the vocabulary as possible observables. When the obser- 
vation distribution of the tag Bigram LD is smoothed, it is only the previously 
unseen observables of this subset that are assigned a non-zero probability, the 
remainder stays zero. 

It should be noted that we are not using the Baum- Welch reestimation al- 
gorithm here. CTR can, if so instructed, return the tag sequence along with 
the normalized utterance just like an ordinary POS tagger. It was discussed 
above that whether or not this is desirable depends on what the context en- 
codes. Especially with the domain oriented tags it can be useful to have the 
utterance tagged since it will reduce the interpretation step, from input query to 
SQL-query, quite substantially. For the purpose of tagging, the language model 
extracted from a tagged corpus will outperform the language model induced 
from an untagged corpus with a maximum likelihood estimation procedure [El- 
worthy, 1994]. For the purpose of predicting the next word however, this need 
not be the case. In the CARS experiment we utilize the tagged corpus, whereas in 
the secretary experiments presented in the following section we will contrast 
the two approaches. 

The results on the domain-tag Bigram Linguistic Decoder reported in Sec- 
tion 6.1.5 refer to the combined results from the five experiments with the five 
partitionings where the LD have been smoothed with the additive smoothing 
scheme with etrans = 10~ 3 and e i, s = 10~ 3 . The most ambiguous word in the 
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language model is three ways ambiguous, so CTR was run under 3-best search 
to guarantee optimal performance. 

6.1.4 The POS Bigram Language Model 

The POS Bigram language model has 31 tags. The tag-set originates from 
the SUC corpus (Stockholm-Umea Corpus [Kallgren, 1990]). The tags used in 
the SUC corpus are traditional Part-Of-Speech with associated morphological 
features. We have made slight modifications to the original set of SUC-tags to 
obtain a set of atomic tags with different syntactic distributions. Examples of 
tags and tag members are: 



• 



• 



Proper noun (PM), e.g. ' u saab u 900' 
Determiner (DT), e.g. ' u all' 

• Verb form finite (VBF), e.g. ' u costs' 

• Noun (NN), e.g. ' u acceleration', 'uexample' 

• Verb form imperative (VBP), e.g. ' u show' 

The complete tag-set is listed in Appendix A. 2. 

The POS Bigram parameters are extracted from the tagged training corpus 
in the same way as was done with domain-tag Bigram. Also with POS Bigram 
both the state transition distribution and the observation symbol distribution 
are smoothed with the additive smoothing scheme. 

The results on the POS Bigram Linguistic Decoder reported in Section 6.1.5 
refer to the combined results from the five experiments with the five partitionings 
where the LD have been smoothed with the additive smoothing scheme with 
(■trans = 1CP 3 and e fc s = 10~ 3 . The most ambiguous word in the language 
model is three ways ambiguous, so CTR was run under 3-best search to guarantee 
optimal performance. 

6.1.5 Results 

When an experiment is conducted, CTR is run on the corpus in batch mode, 
i.e. utterances are processed from an input file and output to an output file. 
This creates pairs of utterances. Thus, resulting from an experiment is a set 
of pairs: (original utterance , normalized utterance). An experiment is 
evaluated by comparing the pairs resulting from the experiment to pairs in a 
result key. The key is a hand-made set of pairs where the first element (the 
original utterance) contains at least one lexical error and the second element is 
the appropriate correction of that utterance. This set is called A. The outcome 
of an experiment are the pairs produced in the experiment where the second 
element is not identical to the first one, and the pairs where the first element 
is identical to the first clement in one of the pairs in A. In other words: The 
outcome of an experiment are the pairs produced in the experiment where CTR 
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has changed the input, and the pairs where it should have changed the input. 
This set is called C. The pairs in the outcome that are also in the key belong 
to the set B, i.e. B = AflC. The outcome of an experiment can now be rated 
with respect to the performance measures recall and precision. 



recall = 



precision = 



B 
~A 

B 



x 100 



x 100 



There is one important point that needs to be emphasized regarding this 
style of evaluation. Since the outcome of an experiment does not only include 
the utterances that have actually been changed, but also those that should have 
been changed, this means that C will always contain at least as many pairs as 
A. The reason for this somewhat unusual evaluation scheme is of course that we 
want to capture the performance of the system regarding real-word errors. The 
implications of this evaluation style is that precision can never be higher than 
recall. If the two metrics are the same, this means that no unfounded changes 
have been made to the input. 

An example of a pair in A: (rust protetion forthese , rust protection 
for these). The first element of the pair contains two errors and we would like 
to extend the performance measure to account for individual errors, not just 
whole utterances. From the outcome of the experiment we can extract the 
counterparts for A, B and C that apply to the respective error categories. We 
have A m , B m and C™ 1 for misspellings, A r , B r and C r for run-ons and we 
have A s , B s and C s for splits. We are also interested in the total number 
of individual errors so the key A tot — A m U A r U A s is added to the list of 
keys. The example pair above that was a member of A also adds {protetion 
, protection) to A m and A tot and (forthese , for these) adds to A r and 
A tot . The five keys provide the five performance categories in the tables below. 



Experiment 


Performance categories 


Recall 


Precision 


Baseline 


Utterances 


72% 


72% 


Misspellings 


74% 


74% 


Run-ons 


100% 


100% 


Splits 


85% 


65% 


Total 


80% 


77% 



Table 6.1: Baseline experiment 



In the Baseline experiment (Table 6.1) there is an 80% total recall. The drop 
in precision is quite small which is not surprising since there is no language model 
to 'disturb' the Orthographic Decoder. The 80% — ► 77% drop is altogether due 
to the bad splits precision. In a handful of places in the corpus there are double 
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space characters inbetween words. Since the LD does not add a cost to the 
forming of words, the superfluous space will be changed to a single character 
word such as ' , ' . The double space in the input utterance does not constitute 
an error by our definition, so an error is introduced and the error is classified in 
terms of the transformation from input to output utterance, in this case a split. 
For example: ' ... models uu and ... ' — ► ' ... models, u and ... '. 

When the LD is furnished with the Unigram language model (Table 6.2) 
performance is enhanced on all categories. The total enhancement (80% — ► 86%) 
compared to the Baseline is due to improved ability to deal with misspellings 
and splits. On four accounts the Unigram model was able to make the right 
decision on 'close calls' regarding misspellings that the Baseline failed to deal 
with. 



Experiment 


Performance categories 


Recall 


Precision 


Unigram 


Utterances 


82% 


76% 


Misspellings 


81% 


76% 


Run-ons 


100% 


81% 


Splits 


92% 


92% 


Total 


86% 


79% 



Table 6.2: Experiments with the Unigram language model 



Experiment 


Performance categories 


Recall 


Precision 


Domain- Tag Bigram 


Utterances 


86% 


79% 


Misspellings 


89% 


79% 


Run-ons 


100% 


85% 


Splits 


100% 


100% 


Total 


92% 


83% 



Table 6.3: Experiments with the domain-tag Bigram language model 



Both the tag Bigram experiments (Tables 6.3 and 6.4) show steady improve- 
ment over both the Baseline and the Unigram 3 . Mutually however, between the 
domain-tag Bigram and the POS Bigram, there is not much difference. POS 
Bigram seems to have a narrow advantage with respect to precision, but the 
two tag Bigram language models exhibit virtually the same results. The advan- 
tage that POS Bigram has because of the richer class-set is possibly neutralized 
by the poorer estimates resulting from the added data sparseness problem. If 
the result that domain classes yield as good performance as syntactic classes 
would extrapolate to a bigger corpus, we would consider this a positive result in 

3 Due to the small test-set it is difficult to show statistically significant improvements from 
one language model to another. The only difference that can be statistically confirmed is that 
both tag Bigram language models arc significantly better than the Baseline. This was shown 
with a x 2- test on the 0.05 level using the numbers for total recall. 
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the context of a dialogue system since the interpretation step (input query — ► 
SQL-query) is substantially reduced by the domain-classification of input words. 



Experiment 


Performance categories 


Recall 


Precision 


POS Bigram 


Utterances 


89% 


82% 


Misspellings 


89% 


83% 


Run-ons 


100% 


77% 


Splits 


100% 


100% 


Total 


92% 


84% 



Table 6.4: Experiments with the POS Bigram language model 

CARS contains some 'impossible' lexical errors. Examples of these are: 

=> s 

=> total-cost per mile ins of rust and 
value-decrease ins of motor-strength 

=> choose the best three with respect to 

fuel-consumption total-cost and value-decs 



(24) 
(25) 



(26) 



Utterance (24) is a strange single character utterance. CTR suggested ' so ' as 
a repair, but we had decided that the subject probably meant 'show'. Utter- 
ances (25) and (26) are both the work of one particular subject. The subject 
is obviously making up new abbreviations. The two instances of ' ins ' should 
both be 'instead' and 'value-decs' should be 'value-decrease'. If it were 
not for these four errors, the total recall performance for the Bigram models 
would be around 96%. 

The CARS corpus is quite small and compared to normal text standards (not 
just dialogue texts) the language in CARS is highly irregular, full of ellipses and 
other oddities. This together with the data shortage and partitioning scheme 
makes the language model parameter estimation very unreliable. On the other 
hand there are virtually no real- word errors in the corpus, and here is where 
a reliable language model is needed the most. The language models used here 
are obviously useful in distinguishing between correction alternatives, but the 
limited vocabulary of CARS also has a positive effect on this problem since there 
are relatively few alternatives to consider. The conclusion must be that it is 
necessary to test CTR on a larger corpus, with more training material and a 
larger vocabulary. 



6.2 SECRETARY 

Eight secretaries at the Department of Computer and Information Science were 
given the task to transcribe a portion of a software manual [IBM, 1993] written 
in English with the purpose of acquiring an error corpus. The software manual 
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is the IBM OS/2 2.1 Installation Guide and the transcription part is an excerpt 
from pages 4-18 to 4-24. The preface and chapters one, two, three and chapter 
four up to page 4-18 are used as training material in the experiments reported 
below and the seven page excerpt is used for testing. The paper copy of the 
excerpt given to the secretaries was freed of formatting except for headings 
and paragraph delimiters. The original text contains a lot of instruction lists 
formatted as enumerations, for example: 

1. ... 

2. Press Enter to display the Options menu. 

3. Select Set startup values and press Enter. 

4. ... 

The unformatted text (relieved of the numbers) looks quite strange to the sub- 
jects and most of the subjects said after the task had been completed that it 
was hard to make any sense out of the text. Another reason for this is that the 
subjects arc unfamiliar with the topic. 

Each of the subjects was instructed to transcribe half the excerpt. Hence 
the error corpus includes four versions of the original text. The only additional 
instruction given to the subject was that "you should type as fast as you can" . 
The reason for this was that we wanted a larger error sample than we would 
presumably otherwise get. The subjects were not told of the purpose of the 
transcription but some of them expressed the suspicion that something along 
the line of error sampling was going on. All subjects used the correct positioning 
of the hands on the keyboard but otherwise their typewriting skills differed quite 
substantially. The time it took to transcribe the text ranged from 16 minutes 
to 41 minutes and while the most faultless typist introduced no errors at all, 
one made 39 spelling- and segmentation errors. 

It is difficult to make any systematic comparisons regarding CTR's behavior 
on CARS and secretary. Even if all the free variables 4 are fixed, any direct 
comparison will still only be approximate; there are obviously different errors in 
the two corpora, the corpora are written in different languages and consequently 
the tag-sets differ. The size of the vocabulary differs and so on. 

In the experiments reported below the free variables will be kept fixed as 
much as possible to facilitate for some comparative studies of CARS and secre- 
tary although the primary intention with the secretary experiment is to get a 
better handle on the error correcting performance of CTR. We have roughly ten 
times as much training material (which should give a better language model), 
we have more errors and a larger vocabulary. To sum up, it is a more realistic 
scenario. What CTR can do with the real-word errors is also interesting. We 
have already concluded that the tag Bigram language model outperforms the 

4 Free variables such as, for the OD: the network topology, error generating functions used 
to produce the error corpus, smoothing of the observation distribution and filtering versus no 
filtering for real-word errors. For the LD: tag-set, smoothing parameters, supervised versus 
unsupervised training. 
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Unigram language model, even with unreliable parameter estimation, so the 
experiments below will only concern the Baseline and the tag Bigram. 

6.2.1 Error Profile 

The name secretary refers to the error corpus, the four transcribed versions 
of the seven page excerpt of the manual. Although this text has been typed by 
eight different subjects and is a four times duplicate, it is regarded as one bulk 
of text below. 

secretary contains 600 sentences and 8,938 word tokens. The sentence 
level error rate is (somewhat surprisingly) almost as high as that of CARS, 
whereas the word error rate is considerably lower. The figures are presented in 
Table 6.5. 





Sentences 


Word Tokens 


Well-formed 
Lexically Ill-formed 


483 80.5% 
117 19.5% 


8,788 98.3% 
150 1.7% 


Total 


600 100% 


8,938 100% 



Table 6.5: Error profile overview of secretary 



There is one important difference between CARS and secretary regarding 
the way in which the error profiles have been produced. In the case of CARS we 
have been forced to make subjective judgments as to what was the 'intended', 
or, 'correct' utterance. With secretary we do not have to do this since we 
have the original text. We know the correct way to type the text down to the 
last comma. This is of course practical, but it also makes it necessary to make 
some distinctions. Based on the scenario, the application, we assume that all 
differences between the transcriptions and the original text are lexical errors. 
The variations that are clearly not lexical errors are not considered. Our interest 
is in studying the error correcting performance of CTR. 

On two occasions a whole chunk of text was omitted. The subject most 
likely looks at the text, looks up at the screen, and then back at the paper 
and continues typing from the wrong place. This sort of phenomenon is not 
considered here. Probably because of the sparse formatting in the original text, 
problems regarding sentence ending punctuation are quite frequent. Sentence 
ending punctuations were wrongfully deleted from regular sentences, and in- 
serted into subheadings. This sort of error is not considered here. Apart from 
these two exceptions, string equality is the measure used to find the errors in- 
troduced into the transcribed text. Because of the nature of the task presented 
to the subjects, all cases of substitutions of one word for another are considered 
real-word errors, even if they are really agreement errors. 

Table 6.6 shows how the lexical errors are distributed in secretary. The 
'easy' errors, the nonword single error misspellings, make up a relatively large 
portion of the errors in secretary compared to CARS. Overall, the figures in 



6.2. SECRETARY 



77 







Nonword error 


Real- word error 


Total 


Missp. 


Single error 
Multiple error 


109 
5 


85.2% 
3.9% 


19 

2 


86.4% 
9.1% 


128 

7 


85.3% 

4.7% 


Run-ons 


Single error 
Multiple error 


14 



10.9% 

0% 






0% 
0% 


14 



9.3% 
0% 


Splits 


Single error 
Multiple error 






0% 
0% 


1 



4.5% 
0% 


1 



0.7% 
0% 


Total 


Single error 
Multiple error 
Total 


123 

5 

128 


96.1% 

3.9% 

100% 


20 

2 

22 


90.9% 

9.1% 

100% 


143 

7 

150 


95.3% 
4.7% 
100% 



Table 6.6: Breakdown of lexical errors in secretary 



Table 6.6 are more in line with what others have found; there are more real- 
word errors, fewer segmentation errors and there are more 'easy' errors. CTR's 
performance on real-word errors was not really put to the test in the CARS 
experiments, but in secretary there is a sample to test the recovery abilities 
of CTR on this error type. 

There is a relatively large portion of really hard real-word errors in secre- 
tary. Eight out of 22 real-word errors would be impossible even for a human 
proof-reader to detect. An example: 

SEC: If you select Yes for Timer, indicate how long you want (27) 
the menu displayed before the default operating system is 
started. 

The proper way to type this sentence, according to the original text, would be to 
substitute 'selected' for 'select'. This is, of course, a very harsh correctness 
criterion, but the task given to the subject was to transcribe the text, not to 
convey the general meaning of the text. There arc examples of real- word errors 
that are not impossible to detect, but still very hard to handle: 

SEC: Specifying Options for the OS/2 2.1 Partition of Logical (28) 
Drive 

'of in sentence (28) should be ' or ' . 

The sort of sloppincss that was found in CARS (see utterance (18) in Sec- 
tion 3.1) is not present in secretary in the same way. This shows in the lower 
multiple error rate of secretary. Table 6.7 displays the low multiple error 
rate and how the singletons are distributed over the basic error types; deletion, 
insertion, substitution and transposition. The ranking order of the basic error 
types is the same as that found by Pollock and Zamora [1983] in their sample 
of 50,000 nonword misspellings. 

It is not easy to say which corpus, CARS or secretary, is the more demand- 
ing from the point of view of error recovery, secretary has more single error 
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Single Errors 


Multiple 
Errors 




Del 


Ins 


Sub 


Tra 


Misspellings 


135 


48 


30 


29 


21 


7 


Run-ons 


14 


14 














Splits 


1 





1 











Total 


150 


41.3% 


20.7% 


19.3% 


14.0% 


4.7% 



Table 6.7: Comparison of the basic error types in secretary 



nonword misspellings and less segmentation errors, but then, judging from the 
results with the CARS experiment, segmentation errors do not seem to be much 
of a problem for CTR. CARS has an 18% multiple error rate while secretary 
only has 4.7% and this indicates that CARS is more difficult. On the other hand 
secretary has a 14.7% real-word error rate compared to 5.5% for CARS and 
real-word errors are clearly the most difficult error type. 



6.2.2 The Orthographic Decoder 



The training and test material taken together contain 1,223 word types so the 
Orthographic Decoder contains 1,223 word modeling HMMs. There are only 17 
words in the test corpus that are not found in the training corpus. In accordance 
with previous experiments these word models are included in the OD. 

The OD setup reported on in the CARS experiment has been evaluated here 
as well. However, the possible variation in the OD setup has been somewhat 
more systematically evaluated in the secretary experiment. We have tried 
both filtered and unfiltered error corpora and three different values for e bs has 
been tried, 10 -4 , 10~ 6 and 1CP 8 . Together this makes six alternative OD setups. 
The results reported below concern the same setup that was used for CARS. The 
effect of the other setups are discussed in Section 6.2.4. 

In the experiments reported in Section 6.2.4, the error corpora were gener- 
ated with the error functions deletion, substitution, and white space insertion. 
There are nine single character 'punctuation words' in the vocabulary and the 
training material is a small set of hand-made corruptions that only involve 
the space character. There are 49 numbers in the vocabulary. These special 
words have corpora generated with only the white space insertion error func- 
tion. One set of ODs had their error corpus filtered for real-word errors and 
the rest were trained on unfiltered corpora. The OD HMMs were trained with 
the Baum- Welch reestimation algorithm. After training the models had their 
observation symbol distribution smoothed with the additive smoothing scheme 
with e nhs = 1CT 4 , e nhs = 10~ 6 and e nhs = 1CT 8 . 
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6.2.3 The POS Bigram Language Model 

The POS Bigram language model includes 50 tags. The tag-set consists of tra- 
ditional Part-Of-Specch with verbs and nouns subcategorized for morphological 
features. Some frequent words with supposedly uniform contextual distributions 
have been given their own tag. The complete tag-set is listed in Appendix A. 3. 
The training corpus consists of 19,975 word tokens. A tagged and an untagged 
version of it have been used for estimating the model parameters. 

Different Linguistic Decoder setups have been tried. We have used one 
smoothed with etrans = 10~ 3 and e b s = 1CP 3 and one that was smoothed 
with etrans = 1CP 4 and e b s = 10~ 4 . Note that there is no way to analytically 
determine the best smoothing value. Smoothing of unreliable distributions is an 
important research topic, and to get a good estimation of the parameter space it 
is necessary to use more advanced methods than we are using here. However, we 
are content to see that the techniques presented here work satisfactorily with a 
not so good smoothing scheme, reassured that with a better smoothing method 
things can only get better. 

In the secretary experiments we are not that interested in the tag sequence 
output from CTR, rather we would like to maximize the predictive power of the 
(weak) language model. The LD trained with the tagged training material have 
been contrasted with the LD estimated from the untagged text using the Baum- 
Welch reestimation algorithm. In Chapter 5 it was mentioned that the model 
trained with the Baum- Welch algorithm will converge to a local maximum. 
Which of the optima the model will converge towards depends (amongst other 
things) on the distributions of the initial model. In the experiments reported 
here the initial model has a uniform transition distribution and an observation 
distribution that has uniformly distributed probabilities for the words that are 
members of the tags (which are represented by the states of the model). In 
other words: the initial model 'knows' which words belong to which tag and 
nothing else. 

Together with the two smoothing setups, the supervised and unsupervised 5 
training methods yield four different LD configurations. The results on the 
POS Bigram Linguistic Decoder reported in Section 6.2.4 refer to the LD that 
has been trained unsupervised and has been smoothed with e tra ns = 10~ 4 and 
tobs = 1CP 4 . The most ambiguous word in the language model is four ways 
ambiguous. In the experiments reported below CTR has been evaluated using 
both 1-best and 4-best search. The outcomes of these different search strategies 
were, however, identical. 



5 Supervised and unsupervised are not very precise terms. The supervision that is provided 
for the LD that is extracted from the tagged corpus applies to the proper tagging of words, 
not to the prediction of the next word, which is what we are interested in. The terms are used 
here because of their intuitive appeal. 
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6.2.4 Results 

The evaluation scheme used here is the same as the one used for CARS except 
that nonword and real-word errors have their own keys. Note that nonword and 
real-word errors on the one hand and misspellings, run-ons and splits on the 
other hand are orthogonal, i.e. A tot = A non U A real = A m U A r U A s . 



Experiment 


Performance categories 


Recall 


Precision 


Baseline 


Sentences 


51.3% 


51.3% 


Misspellings 


54.1% 


54.1% 


Run-ons 


100% 


100% 


Splits 


0% 


0% 


Nonwords 


68% 


68% 


Real-words 


0% 


0% 


Total 


58% 


58% 



Table 6.8: Baseline experiment 



The higher smoothing value worked best in the Baseline experiment, i.e. 
e bs = 10~ 4 outperformed e b s = 10~ 6 and e t, s = 10~ 8 . The filtered and the 
unfiltered versions produced identical results. No errors were introduced. 

There is just one split in the corpus. The split is also a real-word error and 
it reads: 



SEC: If you hav e a Dual Boot partition containing . . 



(29) 



' e ' is a valid word in the vocabulary, it actually has three meanings: the name 
of an appendix (in the manual), the name of a disk-partition and the name of 
a logical drive. Sentence (29) was changed into 



CTR: if you have e a dual boot partition containing . . . 



(30) 



which is obviously not the desired output. (Recall that there is no discrimination 
made between upper- and lowercase characters.) 

There is a considerable difference between the result of the CARS Baseline 
experiment and that of secretary. Since no LD is involved and since the 
OD HMMs are trained and smoothed in the same way, the only thing that can 
explain the difference is the larger number of real-word errors in secretary 
and the increased vocabulary size. 

When CTR is supplied with the POS Bigram LD, there is a considerable 
boost in performance 6 . The OD used in the experiment in Table 6.9 is that 
which has been filtered for real- word errors. 

The fact that the LD smoothed with the lower value outperforms the one 
with the higher smoothing value indicates that the parameter values arrived 
at by the Baum- Welch algorithm are not such a bad estimation. The filtered 



'A x -test showed significance compared to Baseline on the 0.001 level. 
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Experiment 


Performance categories 


Recall 


Precision 


POS Bigram 


Sentences 


78.6% 


78.6% 


Misspellings 


80% 


79.4% 


Run-ons 


100% 


100% 


Splits 


100% 


100% 


Nonwords 


93% 


93% 


Real-words 


18.2% 


17.4% 


Total 


82% 


81.5% 



Table 6.9: The unsupervised POS Bigram experiment smoothed with etra 
10~ 4 and e obs = 10~ 4 



and the unfiltered version give virtually the same results. On one occasion 
the unfiltered version succeeded in correcting a real- word error that the filtered 
version failed to correct. However, the unfiltered version also introduced a couple 
of errors so the net return of the filtered version is slightly better. 

The real- word errors that CTR can handle are those where there is an ortho- 
graphic similarity between the error and the proposed normalization, and, the 
real-word error is part of an unlikely tag sequence. An example of a real-word 
error that CTR successfully transformed is: 



SEC: Of you select advanced, your Boot Manager 
CTR normalized the sentence to 



(31) 



CTR: if you select advanced, your boot manager 



(32) 



which was the desired output. 

The unsupervised LD performs better than the supervised. The optimum 
reached under the restrictions imposed by the initial model with the Baum- 
Welch algorithm is a minimum entropy point. The fact that this model outper- 
forms the model with the higher entropy is by no means surprising. 

The nonword error correcting rate of secretary is the same as the total 
correction rate of CARS (which basically only contains nonword errors). One 
can speculate that the higher degree of multiple errors in CARS is compensated 
for by the larger vocabulary in secretary. 

A closer look at the errors that CTR failed to properly correct revealed that 
an unrepresentative portion of the problematic errors were transpositions. Since 
the transposition error function was excluded from the functions that generated 
the training corpora for the various ODs, we ran a series of complementary 
experiments where every parameter was held stationary except that the trans- 
position error function was included in the generation of the corpora. The result 
is shown in Table 6.10. 
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Experiment 


Performance categories 


Recall 


Precision 


POS Bigram 


Sentences 


81.2% 


81.2% 


Misspellings 


83% 


82.4% 


Run-ons 


100% 


100% 


Splits 


100% 


50% 


Nonwords 


94.5% 


94.5% 


Real- words 


27.3% 


25% 


Total 


84.7% 


83.6% 



Table 6.10: The unsupervised POS Bigram experiment smoothed with etrans 
10 -4 and e fc s = 10~ 4 where the OD has been trained on transpositions 



6.3 Discussion 



Clearly the CTR system can be used to normalize text input to the vocabulary 
and language of a limited domain. The number of utterances affected by lexical 
errors in the dialogue scenario are brought down from 71 (19.2%) to 13 (3.5%) 
by CTR (POS Bigram). In the transcription scenario, the error rate is brought 
down from 117 (19.5%) lexically ill-formed sentences to 22 (3.7%) (from the 
experiment in Table 6.10). Counting only the nonword errors, there are only 7 
(1.2%) output sentences that diverge from the key. 

The results reported in the previous sections validates the assertions made 
in section 3.3. The wide error scope is obviously beneficial. Particularly the 
dialogue scenario with its many segmentation errors would be severly crippled 
without CTR's ability to handle run-ons and splits. The total recall would fall 
from 92% to 59% in both the CARS Bigram experiments if neither real-word 
errors nor segmentation errors could be fixed. The advantage derived from 
modeling of the local context is obvious when comparing the experiments in the 
two scenarios to their respective Baselines. 

A rather ill-chosen, but still interesting comparison can be made between 
the performance of CTR and commercially available text processing tools. We 
ran the test data through the spell-checker used in Microsoft word for Win- 
dows 95 7 . Before testing, the spell-checker was given the complete secretary 
vocabulary. We took the highest ranked correction candidate from the spell- 
checker to be the suggested correction. It is unfair to compare CTR to the WORD 
spell-checker for two reasons: Firstly, the spell-checker is not designed to be an 
automatic spelling corrector, its primary task is to detect errors and bring the 
user's attention to them. Secondly the spell-checker has a vocabulary that is 
considerably larger than CTR's. There are more correction candidates to con- 
sider and on ten occasions it turned out that errors that were nonwords relative 
CTR's vocabulary were missed because they were valid words in word's vocabu- 
lary. The results from the experiment with the WORD spell-checker is displayed 
in Table 6.11. The spell-checker works fine with the nonword misspellings but 



7 Thc spcll-chccker used in WORD is International Correct Spell from INSO Corporate 
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Experiment 


Performance categories 


Recall 


Precision 


MS Word 


Sentences 


49.6% 


42.3% 


Misspellings 


57% 


57% 


Run-ons 


0% 


0% 


Splits 


0% 


0% 


Nonwords 


59.4% 


59.4% 


Real- words 


0% 


0% 


Total 


51.3% 


45.3% 



Table 6.11: Microsoft word Experiment 



is incapable of handling any of the segmentation- and real-word errors. On 20 
occasions the spell-checker stopped on something that was not in error, and was 
unable to suggest a correction. These problems most certainly arise from faulty 
assumptions made by the spell-checker's tokenizer. It will highlight items like 
'Ctrl+Alt+Del' , assuming that it is one word. It is simply the case that ' + ' 
does not delimit tokens in WORD. Nevertheless, it actually performs slightly bet- 
ter than the Baseline (without transposition-training) on the misspelling error 
category. 

It was pointed out above that there is virtually no other research effort taking 
the holistic approach presented here, addressing the entire problem area in a 
unified framework that uses both a model of language production and one for 
typing behaviour and which makes tokenization part of the recovery process. 
So the results presented here do not lend themselves easily to comparisons to 
what others have done. (Another major problem is, of course, that people uses 
different test sets.) However, a couple of notes can be made. 

As mentioned above, Kukich [1992b] made a comparative study of some 
of the more well-known approaches to isolated- word spelling correction on 170 
human-generated nonword misspellings with vocabularies of three different sizes. 
Two of the vocabularies, 521 words and 1,142 words, are quite close in size to the 
vocabularies used in CARS and secretary (584 and 1,223 words respectively). 
The OD component of CTR compares quite favorably to the best isolated-word 
spelling correction techniques. On the smaller vocabulary Kukich reports 81% 
accuracy for the best technique which is just about the same as for the CARS 
Baseline. The secretary Baseline (without transposition-training) result on 
the nonwords is 68% and the result from the best isolated-word spelling correc- 
tion technique is 78% 8 . Note that the isolated- word spelling correction programs 
do not need to tokenize the input, they are given one word at a time whereas 
CTR need to find the word boundaries by itself. The conspicuous fall in accuracy 
when the size of the vocabulary grows gives yet another clear indication of the 



8 A technique called SVD (Singular Value Decomposition) worked best on the smaller vo- 
cabulary. The accuracy for the different programs ranged from 64% to 81%. The Technique 
of Kcrnighan-Church-Gale [Church and Gale, 1991b, Kcrnighan et ah, 1990] worked best on 
the larger vocabulary. The accuracy for the different programs ranged from 54% to 78%. 
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positive impact of small vocabularies. 

CTR does not compare well to other automatic spelling correction techniques 
in terms of processing efficiency. The main reason for the difference in speed 
is due to the fact that CTR does not make any hard assumptions regarding the 
location of word boundaries. However, CTR has the feature of character incre- 
mental processing. In the dialogue application, which was the first intended 
usage, this means that CTR performs real-time text recognition, i.e. it processes 
the input as fast as the user can type. With the 1,223 word vocabulary and with- 
out Beam Search pruning CTR processes the input (on a SUN Sparcstation 5) 
with approximately the speed with which a skilled typewriter would type it. 

The results from these first two experiments are certainly promising, all the 
more so since there is, on several accounts, obviously room for improvement. 

The transposition error type is a weak spot in the Orthographic Decoder. 
The topology of the word models prohibits the transposition errors to be pro- 
cessed as transposition errors. The lack of back-chaining transitions in the 
HMM means that a word containing a transposition error will be 'diagnosed' as 
having a deletion immediately followed by an insertion. This topology-related 
problem can not be completely trained away. The OD trained on transpositions 
(Table 6.10) could correct three transpositions and one deletion that the other 
OD could not handle. (Two out of these four were also real-word errors.) This 
shows that training certainly pays off, but still, the restrictions imposed by the 
topology of the model makes matters more difficult. There were three errors 
that the WORD spell-checker managed to fix that CTR failed on. These three 
were all transposition errors, 'of had been spelled ' fo ' (on three occasions). 
Giving up the left-to-right model topology may very well improve the system's 
ability to deal with this error type. 

Another important aspect concerning the OD is, of course, the way in which 
it is trained. The only information available to the system regarding the cause 
of an error is the keyboard layout and even this information is rather sparse 9 . 
The generation of the error corpora for the OD HMMs is entirely based on 
lexical errors that are accidental by nature. More knowledge can be supplied 
to the system by studying naturally occuring errors, especially those caused by 
cognitive and/or phonetic misconceptions. Content words and function words 
should probably not be treated alike in this respect. People (at least adults) 
usually know how to spell the function words and consequently the errors that 
affect this category are more likely to be mistakes. 

The Linguistic Decoder is the most interesting component of the system from 
the improvement point of view. It has been shown [Gale and Church, 1990] 
that the additive smoothing scheme is inferior to a number of alternative, more 
sophisticated smoothing methods. We know that a better smoothing scheme 
will improve the LD in CTR, the question is just how big the improvement will 
be. 



9 Grudin [1983] has reported on a comprehensive study of errors produced by typists of 
different skill-levels. The fact that the keypads are placed in close proximity to each other 
can just partly explain all the things that can go wrong. 
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The real- word error category is obviously the most problematic. CTR's ability 
to correct these errors hinges on the predictive power of the Linguistic Decoder. 
With sparse data it is necessary to chose a model with fewer parameters, such 
as a tag Bigram language model for example. Such a model has only rather 
vague ideas of what is likely to appear next in a given context. If a real- word 
error happens to belong to the same tag as the intended word, the tag Bigram 
is unable to detect the error. To improve the real-word error correction rate it 
is necessary to deploy a more powerful language model, a language model with 
lower (cross-word) entropy. Mays et al. [1991] used the trigram language model 
employed in the IBM speech recognition project [Bahl et al., 1983] to correct 
single error real-word errors. They managed to detect and correct 73% of the 
real- word errors, and managed this with a rather primitive model of the channel 
characteristics. These are reassuring results. 
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Future Work 



7.1 Practical Issues 

There are a number of issues that need to be addressed if CTR should be placed 
in the hands of an actual user. 

One of CTR's features is the incremental processing. The process is monoto- 
nous, however, so if the user goes back and edits the input string, it will cause 
difficulties. Special care must be taken of backspacing in the input. 

CTR's vocabulary (the OD HMMs) has a high memory space demand. An 
application with tens of thousands of word-forms most likely requires some 
kind of morphological processing, having a network for each word-form is not 
a practicable path. Word-groups (as described above) may marginally reduce 
the problem but prefixes, suffixes and inflectional forms need probably be dealt 
with in a more systematic fashion. It may prove profitable to look into Two-level 
Morphology [Koskenniemi, 1983]. 

CTR does not discriminate between upper- and lowercase characters which 
means that information is lost in the processing of the input. This needs to be 
rectified should the system be used in a more large scale application. 

One way to look at CTR is to think of it as an intelligent tokenizer. The 
system spends much effort calculating the the most likely segmentation of the 
input. As the system presently runs it spends a lot of time with this even if 
the input is error-free and tokenization is trivial. There is however a simple 
way around this waste of effort. The OD could easily be supplied with a more 
primitive (standard) complementary tokenizer. This tokenizer would then run 
as long as there is no problem, and when something goes wrong the OD would 
take over. The LD would run as before, only it would receive one token (as 
in Token Passing) per word-token instead of a number of tokens per character. 
Note that with this technique real-word errors would have to be spotted from 
the LD without any help from the OD. 
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7.2 The Unknown Word 

There is no general solution to the unknown word problem. Under certain 
circumstances however there may be ways to limit the problem somewhat. A 
short word generally has a lot of neighbors, words that are, say, one edit distance 
away from it. With a longer word, say ten characters, there are usually just a 
few neighbors. So if a long word is correctly spelled and a moderately chosen 
beam is used, there will be a relatively small number of viable hypotheses inside 
the beam. If the long word is misspelled (not too severly) , there will probably 
still be a controllable number of hypotheses inside the beam. However, if a 
previously unknown word is typed in, there is a good chance that the best 
hypothesis is quite distant from the optimal hypothesis (which is unknown). 
Since the best hypothesis controls the beam and the best hypothesis is bad, the 
beam will in effect be wider than normal and more distant neighbors will fit 
inside the beam. Thus, if a long word is typed into the system and the number 
of viable hypotheses inside the beam exceeds a threshold, the word is unknown. 
This hypothesis should be pursued. (Note that shorter words are usually not 
unkown.) 

7.3 Exploiting the Potential of the Framework 

It was pointed out above that there are no restrictions in the Token Passing 
framework as such as to how many layers there might be. The layers can imple- 
ment different probabilistic networks, actually they need not even be probabilis- 
tic. The only requirement is that a layer can communicate with its neighbors 
in a meaningful way. Hidden Markov Models are appealing since they have 
one distribution (the transitions) for network internal operations and one (the 
observables) for communication. 

Connected Text Recognition with layered HMM networks in the Token Pass- 
ing framework is quite a machinery. It might even seem like bit of an overkill to 
use this rather complex system just to tokenize an input string and output the 
normalization of it. However, the system leaves room for more complex tasks 
to be performed as well. We stated above that CTR could be told to output the 
tag sequence along with the normalized utterance. An utterance like: 

==> show volvo with impact-safety higher than 3 (33) 

would then be processed and output as 1 : 

CTR: show/CH volvo/OH with/R impact-saf ety/AH higher/VH (34) 

than/R 3/VH 

The portion 'with impact-safety higher than 3' is obviously a restriction 
that the subject wants to have placed on the Volvos that he wants extracted 
from the database. It would clearly be a great help if the system could identify 



1 Cf. Appendix A.l for an explanation of the tags. 
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such phrases. Envisage a third layer, inbctwccn the word modeling Orthographic 
Decoder and the utterance modeling HMM in the Linguistic Decoder, that mod- 
els phrases of the type just exemplified. The additional layer would make the 
LD two-layered. This layer would consist of a number of Token Passing net- 
works (e.g. HMMs) that would segment the stream of words from beneath into 
phrases, phrases with a meaning in the application at hand. One network in the 
new layer would then be the conditional network. The output from CTR could 
now look like: 

CTR: show/CH volvo/DH [with/R impact-saf ety/AH (35) 

higher/VH than/R 3/VH] /COND 

One can also imagine a fourth layer (making the LD three-layered) that classifies 
utterances in terms of their dialogue function (see e.g. Jonsson [1993]). This 
fourth layer would be the new topmost layer and the output could look like: 

CTR: [show/CH volvo/DH [with/R impact-saf ety/AH higher/VH (36) 

than/R 3/VH] /COND] /EXTRACT 

The step from (36) to the SQL-query below is not very long to take. 



select manufacturer .model. year . impact-saf ety from CARS where 
model = 'volvo' and 
impact-safety > 3 

The three fields 'manufacturer', 'model' and 'year' together make up the 
unique identifier for each car in the database. These fields are listed in the output 
by default while the other fields referred to in the question are appended. 
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Appendix A 

Tag-sets Used in ctr 



A.l The Domain- Tags used in CARS 

• 19 tags 

• 584 words 

• 55 words are ambiguous 

— 51 words are two ways ambiguous 

— 4 words are three ways ambiguous 



Tag 


Explanation 


Example 


AH 


Aspect Head 


' f uel-consumpt 


ion' 


CC 


Coordinating Conjunction 


'and' 




CH 


Communicative Head 


'find' 




CM 


Communicative Modifier 


'help' 




K 


Non-sentence Delimiters 


> 




M 


Determiner 


'each' 




N 


Numeral 


'4' 




OH 


Object Head 


'mazda 323' 




ow 


Object Wh-word 


'which' 




p 


Sentence Delimiters 


'?' 




R 


Relation Word 


' instead' 




RS 


Response Word 


'yes' 




SH 


Semiotic Head 


'mean' 




SM 


Scmiotic Modifier 


'not' 




TH 


Table Head 


'table' 








continued on next 


page 
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continued from previous page 


Tag 


Explanation 


Example 


VH 


Value Head 


'0,9' 


VM 


Value Modifier 


' about ' 


VW 


Value Wh-word 


'how' 


X 


Others 


' thanks ' 



A. 2 The POS used in CARS 

• 31 tags 

• 584 words 

• 31 words are ambiguous 

— 30 words are two ways ambiguous 

— 1 word is three ways ambiguous 



Tag 


Explanation 


Example 


AB 


Adverb 


'quickly' 


CIT 


Citation Mark 


C J J J 


COM 


Comma 


c J 
> 


DSH 


Dash 


c _ J 


DT 


Determiner 


'all' 


HA 


Wh Adverb 


'why' 


HD 


Wh Determiner 


'which' 


HP 


Wh Pronoun 


' what ' 


IM 


Infinitive Marker 


'to' 


IN 


Interjection 


'ok' 


JJ 


Adjective 


'lower' 


KN 


Coordinating Conjunction 


'both' 


LP 


Left Parenthesis 


'(' 


NN 


Noun 


' car' 


PC 


Participle 


'enumerated' 


PM 


Proper Noun 


'audi 100' 


PN 


Pronoun 


'these' 


PNO 


Object Pronoun 


' them ' 


PNS 


Subject Pronoun 


'you' 


PP 


Preposition 


'with' 




contin 


ucd on next page 
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continued from previous page 


Tag 


Explanation 


Example 


PRT 


Particle 


' away ' 


PS 


Possessive Pronoun 


'their' 


QUE 


Question Mark 


<?) 


REL 


Relative Marker 


'which' 


RG 


Number 


'1988' 


RP 


Right Parenthesis 


')' 


SN 


Subordinating Conjunction 


'if 


VBF 


Finite Verb 


'classified' 


VBI 


Verb Infinitive 


'see' 


VBP 


Verb Imperative 


'add' 


VBS 


Supine Verb 


'shown' 



A. 3 The POS used in secretary 

• 50 tags 

• 1223 words 

• 78 words are ambiguous 

— 74 words are two ways ambiguous 

— 3 words are three ways ambiguous 

— 1 word is four ways ambiguous 



Tag 


Explanation 


Example 


AB 


Adverb 


actually' 


ACL 


Article 


a' 


AND 




and' 


ARE 




are' 


BE 




be' 


BSL 


Backslash 


V 


COL 


Colon 


. ) 


COM 


Comma 


> 


CT 


Citation Mark 


; ; ; 


DO 




do' 


DSH 


Dash 


_ J 


DT 


Determiner 


both' 




contii 


med on next page 
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Tag 


Explanation 


Example 


DZ 


— 


'does ' 


HA 


Wh Adverb 


'how' 


HD 


Wh Determiner 


'which' 


HP 


Wh Pronoun 


' what ' 


IJ 


Interjection 


'please ' 


IM 


Infinitive Marker 


'to' 


IS 


— 


'is' 


IT 


— 


'it' 


JJ 


Adjective 


'available' 


KN 


Coordinating Conjunction 


'either' 


LB 


Left Parenthesis 


'(' 


MD 


Model Auxiliary 


'should' 


NEG 


Negation 


'not' 


NN 


Noun 


'backup' 


NNS 


Plural Noun 


'actions ' 


OF 


— 


'of 


PKT 


Period 


c j 


PM 


Proper Noun 


'autoexec .bat ' 


PN 


Pronoun 


'some ' 


PNO 


Object Pronoun 


' them ' 


PNS 


Subject Pronoun 


'they' 


PP 


Preposition 


'by' 


PS 


Possessive Pronoun 


'its' 


QUE 


Question Mark 


'? ' 


RB 


Right Parenthesis 


')' 


REL 


Relative Marker 


'that' 


RG 


Number 


'1024' 


RO 


Ordinal Number 


'first' 


SN 


Subordinating Conjunction 


'after' 


SYM 


Symbol 


'%' 


THE 


— 


'the' 


UO 


Unknown 


'md' 


VB 


Bare Verb Form 


' change ' 


VBD 


Verb Past Tense 


'selected' 


VBG 


Verb ing-form 


'according' 


VBL 


Past Participle 


'accessed' 




con 


tinued on next page 
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continued from previous page 


Tag 


Explanation 


Example 


VBZ 
YOU 


Verb Third Person Present Tense 


'installs ' 
'you' 



